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ABSTRACT 



The "semantic integrity" of a data base is said to be violated when the data base 
ceases to represent a legitimate configuration of the application environment it is intended 
to model In the context of the relational data model, it is possible to identify multiple 
levels of semantic integrity information: (1) the description of the domains of the data base, 
as abstract sets of atomic data values (domain definition), (2) the specification of the 
fundamental structure of the relations of the data base (relation structure specification), (3) 
the definition of the abstract operations which are meaningful in terms of the application 
environment (structured operations), and (4) the expression of additional semantic 
information not contained in the structure of die relations nor in the identities of their 
underlying domains (relation constraints). 

A high level, nonprocedural domain definition language facilitates the description of 
domains. Such a language allows the specification of the properties of the values 
constituting a domain, and the action that is to occur if an attempt is made to update a 
column entry such that it does not belong to die underlying domain of that column. The 
specification of relation structure and structured operations can also be accomplished by 
means of high level integrity (sublanguages. 

A relation constraint has three components: (1) the assertion (a predicate on the state 
of the data base or on transitions between data base states), (2) the validity requirement (the 
occasions) at which the assertion must hold), and (3) the violation-action (the action that is 
to occur if the assertion does not hold at a time when it should). Relation constraint 
specification can be related to an expression f ramew or k (classification scheme) which is 
useful for the construction of a relation constraint language and specification methodology. 
Assertions are more than expressions of some relationships among different values in a 
data base; an assertion singles out the data that is constrained, and states the properties 
that this data must possess. A classification is provided of the various predicate types used 
to identify constrained data and to state the properties that they are to possess. 

A semantic integrity subsystem (of a generalized relational data base management 
system) can support the generation and maintenance of integrity specifications, verify that 
these specifications are met by the data base, and lake appropriate action if violations are 
octdccooL 
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L INTRODUCTION 

Rather than just a collection of values, a data base should be a model of some 
application environment When a data base ceases to represent a valid configuration of 
that application environment, the semantic integrity of the data base is violated. The 
purpose of this thesis is to examine the problem of describing and preserving the semantic 
integrity of a data base in the context of a generalized data base system. The general goal 
is to provide a first approximation to a "theory" of semantic integrity (particularly in the 
context of the relational data model), and to provide a basis for a semantic integrity 
specification methodology. This includes an overview of the relevant issues as well as a 
description of a particular approach to the problem, with emphasis on the high level, 
nonprocedural expression of semantic integrity requirements. 

Data base systems (data base management systems) are intended to assume the tasks 
of facilitating data storage, manipulation, and retrieval The data base system should also 
be responsible for maintaining the correctness of the data in a data base, as well as 
providing users with appropriate abstract views of the data. This is particularly important 
for large data bases, as ad hoc and "hand" checking is impractical. 

By way of background, it might be useful to place the notion of semantic integrity in 
perspective, and to better define the meaning of the term as used in this thesis. There are 
a number of ways in which the soundness of data in a data base may be compromised: 

1. The reliability of data may be compromised by errors due to hardware failure, as 
well as those due to failure of the operating system and data base system software. 
Hardware reliability (in the context of data base systems) has been considered 
elsewhere [Fossum 1974, Wilkes 19721 Software reliability is a very prominent 
research concern at present, as exemplified by the work of those concerned with 
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establishing the correctness of programs. In the area of data base systems, 
Hawryszkiewycz and Dennis [Hawryszkiewycz 1972, Hawryszkiewycz 1973] have 
devetoped a formal semantic model of a relational data bale system, defined data 
baac primitive operations in term* of this? model, and proven the correctness of the 
operation definitions Abstract programs). Weber CWeber mm has further developed 
this approach. 

2. The concurrent consistency of data may be violated due to the effects of 
improperly controlled accesses to shared data by multtple concurrent users (processes). 
It is desirable to provide each user with a consistem view of a data base, shielding 
thfaj user from interfering effects due te the atttvtties of o*w while at the 
same time retaining a maximum amount Of legitimate cbhairrent activity. Eswaran, 
Cray, Lorie, and Traiger DEswsimn 197^ have described a high level scheme for 
concurrent consistency control in a relational data base system. Hawryszkiewycz and 
Dennis [Hawryjzkiewra 1972. Hawryszkiewycz WTfl developed a Tower level model of 
concurrent consistency based on a formal semantic model of a relational data base 
system. ' ^ ,: '■•■' ^ l '•■<" 

3. Data security, may be compromised by a failure to pfbpertyiadministratively) 
restrict the wanner h* which a given user may tcce* and manipulate a data base. A 
good deal of pioneering effort in the "a** 0T security itttd protection has been 
accomplished in the context of operating systems. Some »* this work has been 
extended to data base systems, -e.#, *he work of ^ <Srn*i»erlih, Gray, and traiger 
[Chamberun 197S1 andStonebraker »nd Wong [Stortebrtker l*?4tll 

4. The temanttc totetrkv of data U violated when the data bai* ceases to represent a 
legal configuration of the apphcarton environment it is intend^ to model. Semantic 
integrity errors may be Introduced by user error, lack of oiMien*andfng, malice, etc 
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Oriadvertent, improper, or malice In fact, hardware 

and software renabltttiy errors, Con c ur r ent coiii iitengf effd ii r and iecarity errors may 
cause the semantic integrity specifications of a data base tobe violated, for example, 
some user may, because of a failure of the da* base ttcurtty mechanism, make an 
unauthorised change, such as raising hi* own salary from ttQ,eeO lb 190,000: this 
unauthorized c hang e can then caute a semantic integrity constraint' to be violated, 
such as 'all employee salaries are tea than t28j000*. 

This thesis dealt specifically with the fourth aspect of me soundims of a data base, 
namely semantic integrity. In what follows, we assume that hardwire and software 
reliability are guaranteed v*f* by me e«*raung systemX W* »*• assume thee concurrent 
consistency is assured; it ii sufficient to assume, without toss of generality, that a single user 
is interacting with the system at any given time. Security ajsues are not «ui«her owuidered 
in this thesis; - iv;;-;;: 1 )-!.^..^!^,,^. 

LL Semantic Integrity , 

A dam bate it meant to serve as a model erf sorm ttwtied universe, at any given time, 
the values in the data base represent a particular eahf tjnaatson oe that application 
e nvir on me nt Every such world has its own mwwAiogkr a set of rules specifying what 
constitutes a legitimate and plausible configuration of taabenasnwinsent ff k«e^ MfMi It 
should be the function of the data bai* system to insvne that these ruka are not violated 
and therefore that the data hue is not in a iiff»nteniliy4i>o»st»tentttate. 

A basic p remis e wa wiit adopt U that^ as anted bj hlinsky [iilliMky 197eat ''aw 
fundamental property of a data base is that it has an intrinsic meanjnat which 4< invariant 
of its interaction with users* The semantic insterJce soeo tfkBtren» for a data bate capture 
this intrinsic meaning. The dam base system should facn^te the p»ecise eaprewon of 
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the** integrity specifications. We assume that some person (or committee of persons), 
known a* the && && administrator, is responsible for stating the semantic integrity 
specif icatioru for the database. 

It is possible and indeed desirable for the data base system to support multiple 
abstract logical views of a data base These view* mwt however be constructed from and 
consistent with the semantic integrity tpecif ka tioro. (Le, the data base administrator's view 
of the data base). Even providing a vww of the data ba*e which cen«uu of a subset of 
that daw base is difficult, because of the "coneections" between the subset and other 
elements of the data base. 

A variety of causes may nesult hi a compromise of the semantic integrity of a data 
base, including: 

I inaccurate data recording or entry, 

2. inadvertent alteration of data during some transmission or transcription process, 

S. deliberate falsification of data, 

4. loss, omission, or delay of data. 
The ramifications of permitting incorrect data to permeate a data base may indeed be far 
reaching. Crucial decisions may be wrongly influenced, user confidence in the system 
destroyed, and the reliability and performance of the system degraded (including 
application program and packages as weH *t the data base system itself). 

It U generally recognized th»t the problem of bad data In data bases is a serious one. 
Unfortunately, the state of the art in «rror checking in data base systems U quite dismal. 
Most semantic integrity checking U currenUy aewn^hshed by means of application 
programs; data checking mechanisms are embedded m them apphmtkm program*. Special 
purpose data base "audit" rcuttotrsj* alio sometimw used tochtck data integrity. Ex U ting 
commercial data base jystemi perform limited types of integrity checking, If any. This 
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checking Js nearly always ItmUed to simple data format checks, ta mt i<***ft semantic 
integrity information and checking I* uHuUy unsUTicturcd, and is e mb e dd ed in. application 
programs in an ad hoc nunner [Gosden 19741 f u^htonefe, no dit^ne ii impo»a* on the 
semantic integrity specification process; TW* iacJtof jtruaupf an<14i»CipUn«^ha» the 
following consequence* 

L The mechanism by which semantic iateg;fj|y spjcjf trations are chocked Is diffuse. 

2. Semantic integrity specif iations are not readUty nwdif iafefc. * d; 

3. The abstractioiv defied by the wn^ttc integrity specifications, which is intended 
to correspond to the set of rules in the appttcation. environment, U difficult to 
understand. . . «>;<••-*'*•---,, -■>< '.'-■■'• , 

4. Inconsistencies and redundancies can, ; bebpj$iefjl in $he semantic integrity 
specif icaUons, which nuy be difficult to locate. ,...,. nr ^ 

5. It is difficult to make the sernantic^^ntegrity checking process efficient, ofthef by 
means of manual or automatic optimization. ,, ^, = 

1.2. The Data Model .■■■■;«.:-. -x-.-j 

The daft model upon which a data base system it bawd }s defied her* to consist of 
the type(s) of data structures used to represent information^ m the data base, atang with the 
set of primitive operations, which can be ut«d to rrw>ipM^ tiwet^SOfuctures- The nature 
of the data model underh/Aag.a database system ha* a y try significant effect on the 
manrjer in which one describes the. semantic intagf^pf a database m that syjiwn. As 
described below, some semantic integrity information is often in fact embedded in the 
structures used in the dau model [Date l»76, Mommew iWti- ^ , 

There have been three principal data model* proposed for generaiiaed data base 
systems {Date 19751 , 
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1. For historical and other reasons, the hierarchical approach is a very popular one. 
Example* of hierarchical data base systems and data sublanguage* (languages fo r 
defining and manipulating data bases) include IMSf EWM1 HQJL ffehder 19741 Data 
Language tMariHI973) and System 2000 [MR! OTS in the hierarchic approach, 
some semantic integrity information is expressed in the form of one-to-many 
relationships (trees). Thus, one^o-many constraints are expressed b> appropriately 
constructing the data base hierarchy. 

2. The fflWwa* approach is stypif ied by the Oadasyl DBTG proposal [Codasyl 1971a] 
and the work of Bachman {Bachman 197SI An example of a network data base 
system is Adabas {Software AG 19741 In the network data model, some semantic 
integrity information it expressed vi* many-to-mafty relationships; this is done by 
appropriately constructing the network structures of the data base. 

3. The rejattonal approa«h was introduced oy ©odd tCdctd 1970] [Codd 1974a]. 
Examples of relational data base systems and data sublanguages include ALPHA 
[Codd 1971a], INGRES [McDonald 1974a, McDonald 1974b, Held 1975bl MACAIMS 
[Goldstein 19701 Query by Example Eloof 1974, Zloof 1975a, Zloof 1975bX RDM5 
[Steuert WHl R1SS [McLeod 1975], SEQUEL |I0yWl97>&s Chambertin 1974b, 
ChamberUn 19751 and SQUARE [Boyee 1973b. ioyce 19751 In the relaUonal data 
model funcaonaldependencies are normally included in the specification of the basic 
structure of relations. However, as discussed hf section 1.S, these functional 
dependencies may be easily separated from the baskTifruciure of the relations of the 
data base. 

Several (higher level) semantic data modeh have been recently proposed [Chen 
1975, Schmidt 197s, Senko»», Smith 1976; T sichritzis iml These higher level 
models attempt to incorporate more semantic integrity information in the basic 
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structure of a data base. Structures In these data models are intended to represent 
objects, attributes of objects, and relationship* between otfjict* (in the application 
environment). Semantic Operation* On these structures represent legitimate changes in 
the application environment ^. 

It is not the purpose here to analyze these data models In detail, attfcough many 
of the ideas developed herein are quite ctb^f rel&e? to%oflt an semantic data 
models. Rather, and for reasons tol* explainedlater. rtil relational data^mddel Mil 
be used herein, as a basis for the disCutsiitfn of data*baW" semantic Irn-egrity. 
Although the ideas discussed in Hoi thesis art applkabfc to data blse systems in 
general the discussion U couched in terms of th^ 

13. The Relational Data Model 

The relational d*u model "appears to be the simplest data sfrocturi consistent 
With the semantics of information and which provides a maiHtnuiti degree of data 
independence" Boyce t§^ Tn the 

relational approach there exiscran mwrfaceafwhKH th¥tol$of*forniatted data in 
a data bate can be viewed as a collection of nonlfieYirchii^rilatidna of a^Mrted 
degrees on a given collection of simple o^maint (domains #h(^ eieirtems art not 
decomposable as fat as the data basemanfj^ment system i^^focernedV* 

For the purposes of this thesis, a (refittehaft daja b^ f* defined to be a 
collection of normalized relations (relations in first normal form PCodd liWfl, ind a 
collection of domains. (The nftaxtons rpresehf *S WdaStftise^ar* spettfically called 
base relations. ) A domain ban abstract set tfatomttfdata iftil&i HSb iettsV flomains 
are defined independently of relations. A normalized relation 1 maVbl^Viewett as a 
table, wherein each row of the table corresponds to a tuple of th* relation; and the 
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entries in a column belong to the set of values constituting the underlying domain of 
that column. (An entpr is the value in some particular column for a given row of a 
relation.) The domain underlying a column consists of precisely those objects which 
can appear as entries in that column; any value in the underlying domain of a 
column can appear in that column, and every value in the underlying domain is a 
plausible entry in that column. Note that domain and relation names are unique with 
respect to a data base, and that a domain and a relation cannot have the same name. 

Consider, for example, a data base which contains Information about some 
company. Assume that a relation called EMP contain* data on the employees of the 
company. EMP is shown in figure 1-1, described by its table representation. The 
rows of the table correspond to tuples of the relation (records), and the columns 
correspond to instances of particular domain* of the data base. (Loosely speaking, a 
relation corresponds to a "flat" file, a tuple to a record, and a column to a data field.) 

Each data base relation is created by naming the relation and each constituent 
column, and specifying the name of the underlying domain of each column. More 
than one column in, a relation may have the same underlying domain- Column 
names are unique within a relation. Specifyu3g the name of the underlying domain 
of each column defines the set of values from which entries in that column may be 
selected; that is, the set of entries in a column is always a subset of the underlying 
domain of that column. 

Figure 1*2 contains * description of an mmgrie data basew The name of each 
domain and relation of the data base is listed therein (in Upper case characters). For 
each relation, the name of each of its constituent columns is specified (by one upper 
case character followed by lower case characters), as is the underlying domain of each 
column. Relation EMP contains information on the employees of the company, 
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SALES record! information on the suppttes of items tor the company, ORDERS 
records order information, and BUDGET contains the salary budget for each 
department of the company. 

Figure 1-3 contains a list of some example primitive operations which may be 
used to interact with a relational date base. 1t tsi^melf ^ ni add-on to these 
operation*, a high level, nonprocedural query tanguagt is provided C«^-» SE^iEL 
[ChamberUh W4bl Qpfct tHM itobX or <prfny £*ampll C2^of1i75al}. 

The advantages of Hi* relational data inedefkve beef piously elucidated 

■ ■- :.* 4 t . , \* ft f. >' * ■■„' 

[Codd 1974c, Codd 1975b, Date 19741 afid^n not be repeats 

purposes, the foltowmg attributes of the relational ^nl^lfd^ate^noWaignif leant: 

L Aca» paths are rwt apparent in the logical view of dacs. 

2. The data model is conducive to (relatively) nonprocedun selection, query, and 

manipulation languages. 

S. It is possible to cleanly isolate the different levels of semantic integrity in the 

relational data model as discussed in chapter 2.Tdf example, in [llw hierarchical and 

network data models, certain types of inlegriry cowtrliin^aTi d^tfbefately buitt into 

the data structure itself (eg., the owner-coupled set construct in tht network Model). 

The da* base administrator is thus faced w^ probW^^tlpai^^ the semantic 

integrity reo^lranems from ^^ However, in the 

relational data model; Tie data base administrator has only oitt type of structure to 

consider, and a very simple cWdinifc iysie^ ^ columns 

by name and rows by coniendiiy which ft may reteil* any individual item or 

portion of that structure." fjS«el0753r 
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2. SEMANTIC INTEGRITY 

In the context of the relational data model, it it possible to identify four principal 
levels of semantic integrity: 

1. Qoroait! definition i* the description of abstract sets of atomic data values, which 
are to be used to specify the set of values from which entries in columns of relations 
can be selected. This can be accomplished by means of a high level domain 
definition language [McLeod 1976a, McLtod 1976b]. For example, the domain 
SALARY may be defined as consisting of positive integers less than 100,000. 

2. RjlatiOTsmictjire specif ication is the description of the fundamental structure of 
the base relations. This includes naming each constituent column of a relation, and 
stating the underlying domain of that column. 

S. Structured operations are abstract operations, which are meaningful in terms of the 
application environment Structured operations describe data base transactions, and 
are used to capture the conceptual types of manipulations that are meaningful for a 
data base (such as. for the example data base of figure 1-2. an. operation HIRE- 
EMPLOYEE). 

4. The relation constraints level is concerned with relationships among data base 
components. Relation constraints are used to define all additional semantic properties 
of and relationships between the relations of a data base. For example, primary key 
CCodd 1970) (and third normal form [Codd J971b. Qodd 1971c]) specification is 
accomplished by appropriate relation constraints. However, relation constraints go 
far beyond merely supporting functional dependencies; they provide {he capability to 
define a very rich variety of types of data properties. For example, relation 
constraints may disallow inconsistencies between column entries of a single tuple or 
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between a tuple and other tuple* in thewueeor other reiation(«X They may abo 
preclude some global patterns in some set of tt^to ina^»iation>or fiieiiaei bate ua 
who Wv or may disaJlow cemin types of miuirtg data (such as mtotae; tupiw, obsolete 
values, etc.) ,-,--^. ■■" ;, i-'~ 

Before furtoer describing, the approach to ownantic irttegrity wtfich fc taken in this 

thesis, we briefly examine other work that hw been dene tn the aeea of seman t ic integrity 

in data base systems. 

2.1. Background 

In general, there are two ma)or approaches to the ipedftcation of the semantic 
integrity of a data base: ■■•■'■<,;>•. v ; - -wto.:. ■:■■- 

1. In a tttte snapthot approach, rukt are itated that specify which data base states *#* 
permissible (valid states). Die a^ bw system te Tespc*«*b»* ^ 

data base is always in a valid jute. (As discussed tti * law chapter, U may b« 
necessary to attew the data base * temporarily pe^trtrotlgh one or more invalid 
stateej ■-■ ■■''■,, . .«.•■:'"' .'V-;^ '■ '.*■ S'fv.' ■.:■••■•'■ ;-.'.^t;-'- ■■ ^:: ;■-■ - 

2. IB a state transition approach, the set of legal data bue operations is specif ied. 
Depending on the data ba*« state, only ceaain operations (f aUd operations) are 
allowed to be performed on that state. These operatioasare guaranteed to prtterve 
the integrity of the database. . 

A state snapshot approach to dcscetbing the semantic isffisgrfty ipecifkatioru for a 
data base involyea the expression of legiial contfraie^ which ttanh* ^ 
on the state of the data base* Th«te<^«f4JnB IWrA the>«tat»of adatabas«« thojethat 
conform to some expressed limitations; Several aotber» (Boyce WSa, Zswaran 1975, 
Stonebraker 1974c, Stonebraker ld75c, Zioof 19Hb}ha»e discMSied semaatic integrity 
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assertions in the context of the relational data model. Graves [Graves 1976} has also 
considered the problem of semantic integrity. 

More specifically, Boyce and Chamberlin [Boyce rtTSa] introduced the use of 
SEQUEL predicates for expressing integrity assertions. For an operation which makes a 
data base change to be allowed, the predicates must hold on the data base state which 
results as a consequence of the execution of that operation^ Eswaran and Chamberlin 
[Eswaran 1975] have discussed the functional requirements of a semantic integrity subsystem 
and have examined semantic integrity in the context of SEQUEL and System R 
[Chamberlin 1975, Eswaran 1975]. Stonebraker and Wong have considered semantic 
integrity hi terms of the INGRES system and the language QUEL [Stonebraker 1974c], and 
introduced the concept of query modification as a tool for the implementation of a semantic 
integrity subsystem [Stonebraker 1975c]. Consider the following example of query 
modification: a data base operation is attempted which states "increase the salary of each 
employee in the sales department by 10X"; assuming the existence of an integrity assertion 
which states that "each employee salary is lew than 180,000", query modification would 
transform the operation into one which specifies "increase the salary of each employee in 
the sates department by 10*, if that increase results in his salary being Jess than 180,000". 
Zloof has studied the problem of semantic integrity with respect to the expression of 
semantic integrity specifications in Query by Example [Ztoof l&75bl 

In these approaches, facilities are provided to allow the user to state predicates 
(expressed in SEQUEL, QUEL, or Query by Example) which are to hold on the data base. 
Assertions must be satisfied by the result of a database change for that change to be 
allowed. Several significant problems exist with these approaches: 

1. They do not deal with the entire problem of semantic integrity in a relational data 

base, but rather focus primarily on relation constraints. 
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2. They are inadequately flexible with regard to when a#ir|Joj|i are tcj P* checked- 

3. The types of actions possible upon detection of semantic integrity violations are 
limited. 

4. No structure is placed on the semantic integrity ipecificatioriJ; a»»ertion* are 
arbitrary prediates on the state of #ie data base or on tramitions from one data base 
state to another. 

A state transition approach to semantic Injtgr^ spedficaUon consisu ,o| ^escribing 
the set of legal operations which may be performed on a data base. In this approach, the 
user is confined to interacting with the data base b^ means <?f a limited set of options. 
Semantic integrity information is thus procedurally embedded in the operations. This 
approach has been suggested by Minsky Winsky 107^ > MH^y,i83^4l,Kh«.^t)KC.or 
data base systems. Related work in the area of the definition of abstract data types (e.g, 
the work of Liskov and Zilles DLiskov 1S74)) has much in common with this operational 
approach. ,, 

Some of the most significant problems with the stage tranj$on approach are , 

1. Semantic integrity information is embedded, ; lyrt prjocadjrj||4| an s un>txuc|ure4, 
manner, and is consequently hard to modify % and pottigajly redundant, incoruittent, 
and incomplete. ,„,..., ,. 

2. The conceptual semantic model of a data base is difficult to abstract from the 
procedurally embedded semantic integrity information. 

3. It is difficult to verify the correctness of the semantic integrity information, as it is 
scattered through the operations. 

4. It is not always possible to precisely characterise the se| of operations which are 
meaningful for a data base at the time the 4*ta base is created, Qa|a is often kept in 
a data base before uses for it are discovered, or at least before all of its potential uses 
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are discovered; nevertheless, it is often possible to describe the semantic integrity of 
this data by means of properties it must satisfy (eg., assertions which must hold on 
the data}. 

5. Different data base "views" (external schema*) may include very different sets of 
semanticalty meaningful operations, while still couched in terms of a single data base 
schema (conceptual schema). It is difficult to insure the consistency and completeness 
of the semantic integrity cheeking which is performed by the operations in different 
views. 

6. Some data base operations are not meaningful in terms of the semantic integrity of 
a data base, but are nonetheless required in practice (eg, an operation to change a 
person's date of birth, the value of which was originally incorrectly entered into the 
system). 

22. An Approach to Semantic Integrity Specification 

The major goal of this thesis is to provide a first approximation to a "theory" of 
semantic integrity, particularly in the context of the relational data model In so doing, it is 
hoped that a basts for a semantic integrity spedhcarJblh methodology will be developed. 
This methodology should assist in the formulation of the semantic integrity rules of a given 
application environment, and direct the selection of those rules which will constitute the 
semantic integrity specifications of a data base (eg„ in the face of implementation cost 
tradeoffs). 

A semantic integrity subsystem must be capable of performing: 

1. semantic integrity checking (error detection). 

2. semantic integrity violation localization (determining precisely which data values 
are in error). 
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3. semantic integrity violation-action (reporting/response). 
The semantic integrity specification languages) must provide the user with the ability to 
state all information required to perform these tasks. (This includes, of course, a precise 
specification of the semantic integrity rules themselves.) 

Actually, it is desirable not only to encapsulate (in the data base semantic integrity 
specifications) knowledge about the semantic integrity of a data base, but also knowledge 
about how users will interact with the data base. The meaning of a data base includes the 
manner in which users interact with it; semantic integrity and user abstraction are closely 
related issues. 

Some semantic integrity information is best expressed via a state snapshot approach, 
while other information is best expressed in terms of state transitions. The approach 
described in this thesis includes both state snapshot and state transition aspects. 

Basically, then, the approach to semantic integrity taken here has several major 
objectives: 

L It should be possible to express semantic integrity specifications: 

a. on a high level, 

b. declaratively, rather than procedurally, 
c in a structured manner, 

d. abstractly, in a way relevant to the application environment. 

2. These specifications should be 

a. easily modifiable, 

b. nonredundant, 
c consistent, 

d. complete (as a model of the application environment), 

3. Semantic integrity checking should be: 



Semantic Integrity Specification 22 

a. the responsibility of the system (but the system may sometimes need to ask 
for advice from the user), 

b. flexible, allowing appropriate specification of when checking is to be done 
(eg., after primitive data base change, after conceptual transaction, etc), 

c acceptably efficient in terms of the overall performance of the data base 
system. 
4. Semantic integrity violation-action should be 

a. flexible, allowing an appropriate violation-action to be specified (e.g., 
including error reporting, corrective action, etc), 

b. sufficiently "localized" so as not to generate time-consuming, expensive, and 
potentially destructive "side effects". 

The approach to semantic integrity described in this thesis may in fact be viewed as 
a generalized approach to data base design and/or data definition. That is, we are 
attempting to provide a framework by which the data in a data base may be described. 
Additionally, the framework described herein may prove useful as a base language into 
which specifications in terms of a higher level data model (such as those described in 
[Chen 1975, Schmidt 1075. Senko 1975, Smith 1975, TslcWtais 1975$ may be translated. 
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3. DOMAIN DEFINITION 

The purpose of this chapter is to discuss domain definition, one level of semantic 
integrity in the context of the relational data modtl Specifically, the precise definition of 
domains, viewed as sett of atomic data values, is considered. This includes a review of the 
functional requirements for dealing with the problem of domain definition, a discussion 
and evaluation of other work that has been done in the area, and the description of a 
specific solution to the domain definition problem. 

It is important to note that a domain is different from a unary relation. Domains are 
abstract sets of atomic data values, and may in fact contain an Infinite number of elements. 
A relation, by contrast, must contain a finite number of tuple*. Abstractly, relations are 
subject to change (eg, by the addition of new tuples), but domains are changed only when 
the associated abstraction changes. To a crude first approximation, the set of values 
constituting a domain is fixed at the time the data base is defined ("compile time"), while 
the set of tuples in a relation is normally changed during the day-to-day operation of the 
data base system ("run time"). 

Domain semantic integrity errors, i.e, errors which involve the presence of entries in 
some column of a relation which do not belong to the domain underlying that column, 
occur frequently enough to justify a facility to handle them.. Specific experience with a 
particular data base application environment has shown that, for an experimental sample 
of user-data base interactions, a large percentage of errors discovered are domain semantic 
integrity errors [McLeod 19751 
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3.1. Describing Sets of Atomic Data Values 

As discussed in chapter 2, several approaches to semantic integrity for relational data 
bases have been recently presented. As noted in that chapter, all of these approaches 
essentially deal with relation constraints, Le., facilities are provided that allow the user to 
state predicates (expressed in SEQUEL. QjJEL, or Query by Example) which are to hold 
on the data base. 

The requirements of domain definition are not adequately supported in these systems. 
They lack the capability to allow domains to be precisely defined as abstract sets of atomic 
data values. All of these systems allow the data type of each column of a relation (not each 
domain of the data base) to be defined, but the possible types are limited and very 
representation-oriented. It should be possible, for example, to define domains like 
SOCIALJ5ECURITY DUMBER and GEOjCOORDINATE, rather than being limited to 
such domains as INTEGER and GHARAGT£R_STRING: It is desirable to be able to 
describe a conceptual das* of data values. This abstract description is quite different from 
a mere specification of the physical representation of the values in a domain; rather, the 
semantic properties of the domain are pronounced. The work of Liskov and Zilles [Liskov 
1974] concerning abstract data types is related to this notion, in that classes of abstract data 
objects (values) are being described. 

Boyce and Chamberlin [Boyce I97Sa] have proposed attaching attributes to each 
column of a relation ("column descriptors"), One of these attributes is the scope of a 
column, which specifies the set of permissible values for entries in that column, eg., salary 
is a positive integer less than 20000. Similarly, Zloof [Zloof 1975b] has indicated that 
provisions should be made for facilitating the specification of entry "formats" ("their type, 
size, etc,"). 

A detailed scheme it needed to facilitate the precise description of domains, and to 
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integrate the domain definitions with the structure of the tehtfonal data base. Such, a 
scheme shouH (at leait) satisfy the foltowing criteria: 

1. facilitate the precise and detailed description of seu of atomic data valises, as 
subsets of one of the natural domains: ml number and character string (these 
"natural" domains are the primttive domains which ar* used to construct other 
domains), 

2. provide for the proper abstraction of defining domains independent of their use as 
underlying domainaof columns in one or mece relatione, 

3. force a domain definition to be a single module so that domain semantic integrity 
information i* localized, 

4. facilitate automatic domain definition checking and flexible types of action which 
aire to occur upon detection of a domain drfintlon vJoUtkin, 

5. support ipecif faatiem that describe when and how domain values an be compared 
(eg., when two values being compared are from me. saiM domain, and when the* two 
values are from different domains), and converted (eg., when it is desired to convert 
the value in one domain into aid "equivalem" vahie in another doinain). 

3.2. A Domain Definition Language 

A high level, nonprocedural language an be used to express domain definitions. In 
this language, each domain in a data base is described by a single domain definition 
( domain definition module) . The definition of a domain is ?fns*j!lsd* (bound) at the time 
the domain is created. Domain creation may be viewed *s the compilation of the domain 
definition module. Note that a domain definition specif fc* a« underlying set ef atomic 
values. Domains are not dynamic as are unary relations; rather, they constitute fixed 
abstract sets of data values. The definition of a domain may be modified, but this occurs 
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only when the abstraction has changed. 

As noted by Hammer and McLeod [Hammer 1975], three types of information are 
required by the semantic integrity subsystem to deal with domain definitions: 

1. a specification of the set of atomic data values constituting the domain, 

2. information describing when the domain definition is to be checked, 

3. a specification of the action that is to occur if the domain definition is not 
satisfied. 

Since we shall assume that domain definitions are checked whenever an entry in some 
column of a relation is created or altered (e.g., by an operation which inserts or updates a 
row), the specification of when a domain definition is to be checked need not be explicit. 
Thus all that need be explicitly expressed in the statement of a domain definition is the 
precise description of the set of values comprising the domain, and the action that is to 
occur if an entry in some column of a relation is created or modified so that it does not 
belong to the underlying domain of that column. 

Each domain definition therefore consists of the following four components, 
represented as clauses in the domain definition language: 

1. Domain name 

2. Description 

The description clause allows the set of atomic data values constituting a domain to 
be specified. The set of values constituting a domain is defined as some subset of 
one of the two natural domains: real number and character string. Every domain is 
thus defined and represented as a subset of the real numbers or of the set of 
(varying length) character strings. This specification may be accomplished by: 

a. enumerating the domain values, 

b. decomposing the domain values by specifying the subunits of which they 
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are composed, 

c placing restrictions on the set of values by stating predicates that describe a 

subset of one of the natural domains, 
or a combination of the above. The special data value "null" (undefined) is present 
in each domain. This is to allow missing data to be represented in the data base. (It 
may sometimes be useful to distinguish an "unknown" value from a value which 
"does not make sense" DFlorentin 1976], but this distinction is not made here.) 

3. Ordering 

The ordering clause is used to indicate how domain values are ordered with regard 
to comparisons with other values in the same domain. This information is important 
in identifying the semantic properties of a domain. One type of ordering 
specification is that the values in a domain inherit the (total) ordering of the natural 
domain of which the domain is a subset Inherited ordering may also be by subunit 
(e.g., the primary ordering is by one subunit, the secondary ordering by another 
subunit, etc.). Inherited ordering is numeric for domains which are defined as 
subsets of the real numbers and lexicographic for domains which are defined as 
subsets of the character strings. Another type of ordering specification is that no 
ordering exists, in which case only equality comparisons are meaningful. An external 
procedure (i.e., a procedure in some programming language other than the domain 
definition language) can also be used to define the ordering specifications for a 
domain; this procedure is called whenever two values in the domain are to be 
compared. Such a procedure accepts two domain values (which are to be compared) 
and returns the value that is first in the ordering sequence. 

4. Violation-action 

The violation-action clause specifies the action that is to occur if an entry in some 
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column of a relation is created or changed in such a way that the entry does not 
belong to the underlying domain of that column. Types of violation-action include: 

a. the change may be refused and ah error signaled, 

b. a particular value, either constant or calculated from the erroneous value by 
means of operations (such as substring, concatenate, etc) may be substituted as 
the new value of the entry, 

c a call may be made to an external procedure, the erroneous value being 
passed as the argument to the procedure, and the procedure returning the new 
value of the entry.' 
System-generated or user-specified messages may be optionally returned to the user or 
calling program. Note that in cases b and c, it may be necessary to recheck the 
domain definition after the corrected value Of the entry has been determined. 
At this point it should be noted that the use of external procedures for ordering and 
violation-action specification should be minimized, insofar as possible The capability for 
such use of external procedures is provided for generality and completeness. 

3.2.1. Language Details and Examples 

Figure 5-1 contains domain definitions for some of the example data base domains. 
An indentation-oriented syntax is used in this figure. Examples of values in each domain 
are listed (in parentheses) to the right of the corresponding domain definition. 

Figure 3-2 contains a specification of the syntax of the domain definition language. 
In figure 3-2, syntactic classes are denoted by lower case strings, while keywords are in 
upper case, actually, the language should include both upper and lower case keywords. 
Optional parts are enclosed in "□", and alternatives are separated by "j" 

In figure 3-1, the description clause of the NAME domain definition specifies that it 
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consists of (character) strings, each of which is composed of a string followed by a ". ", 
followed by another string. In this description clause, data values are decomposed into 
subunits; the first and third are variable subunits, while the second is constant. Subunits 
may be labeled, so that they may be referenced elsewhere in the domain definition. As 
stated above, external to a domain definition, the data values constituting a domain are 
either atomic numbers or atomic strings. The rule is, if a description clause of a domain 
contains only number subunits (variable or constant), then the values in that domain are 
numbers, otherwise they are strings. Number and string subunits may be mixed, and if so, 
number subunits are converted to string form to yield the string values constituting the 
domain. For example, domain MONEY is defined to consist of strings of the form 
"$25,000". Values in domain MONEY have two subunits, the first of which is the string 
constant "$". and the second of which is a positive number. Values in domain MONEY 
are thus represented as strings; the number subunit of any value in domain MONEY U 
viewed as a number (and can be manipulated as such, eg., by V) when the subunit alone is 
considered, but it is viewed as its string "equivalent" with regard to the domain value as a 
whole (and can be manipulated by string operations). 

The description clause of the domain SEX indicates that it consists of two data 
values: "female" and "male" (in addition to the ever-present "null"). This is an example of 
description by enumeration. 

For domain MONEY, the subunit labeled "value" must be greater than or equal to 
zero, as specified by the subunit where restriction. A subunit where restriction contains a 
predicate that is to be true for the subunit and involves only that subunit; that is, this 
predicate is a restriction on the set of numbers or strings which values for this subunit may 
have. It is thereby possible to express properties of number subunits involving comparators 
(such as "-" and ">") and number constants. It is also possible to state that a number is an 
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exponential (exponential notation) or an integer (at for domain DATE). For string 
subunits* a size (length) specification can be made, the ^ of characters permissible in a 
string can be defined (as for domain ITEM); and a lexicographic ordering comparison 
(such ai "-" or ">") with constants can be stated. 

A global where restriction permit* expression of properties involving multiple 
subunits, as well as those on dfcmain values viewed as a unit A global where restriction 
contains a predicate that inay involve a domain value, subunit values, operations, and 
comparators. String operations can be employed to generate substrings, calculate lengths, 
perform concatenations, etc. Number operations include the usual arithmetic operations 
and "maximum* and "minimum". For example, in the description of domain MONEY, the 
global where restriction states that domain values (viewed as strings) must either have two 
digits to the right of the decimal point or else have no decimal point. Here, "right(*. V ♦ 
I)' evaluates to the right substring of the domain valueXwhich is referenced by V), starting 
at the character after the occurrence of ".*;' tThis form of the "right" operation takes two 
arguments: a string whose right substring is to be calculated, and another string whose 
index in the first string is calculated to determine at which character 6t the first string the 
right substring is tb begin.) The operation "present* yields 4rue" if the first string 
specif ied contains an occurrence of each of the following strings, otherwise it yields "false". 
The global where restriction of domain ITEM illustrates the specification of the number of 
times some contiguous group of subunits can repeat. 

A where restriction may also contain a call of an external boolean procedure (as for 
domain ITEM). If this procedure call is in a global where restriction, the procedure is 
invoked with the domain value in question as Hi argument; the procedure returns "true" if 
the value is present in the domain, otherwise it returns "fa^e^ If the procedure call is in a 
subunit where restriction, the procedure is invoked wimtne subunit value in question as its 



Semantic Integrity Specif ication 31 

argument; it returns "true" if the subunit value U legal, otherwise i^ returns "false". 

Boolean combinations of the above types of where restriction are allowed in both 
subunit and global where restrictions, as are conditionals (as for domain DATE). In 
addition, an "or" may be used to indicate that the domain contains values that come in more 
than one form, Uv, that the domain consists of the union of two or more sets of values, 
each of which is defined separately. 

The second clause in a domain definition is the ordering daute. This may specify 
that no ordering exists on value* in the domain ("none"), which means that only equality 
comparisons are allowed (as for* domain SEX). An ordering specification of "atomic" means 
that values in the domain are ordered by the usual numeric or lexicographic ordering, 
viewing the domain value* as atomic numbers or strings (as for domain QUAN). The 
ordering clause may also contain an ordered list of labels (subunit names), indicating that 
domain values are ordered according to the values of the specified subunit*. The usual 
numeric or lexicographic ordering on these subunits is used, and the sub-units are taken in 
sequence primary ordering, secondary ordering, etc (as for domains NAME, MONEY, 
and DATE). Finally, an external procedure can be tised to specify the ordering on the 
values in a domain. This procedure is .passed the twq values being compared, and returns 
the value that is first in the ordering sequence (as for domain .ITEM). 

The third clause to a domain definition is the violation-action clause. As discussed 
above, it may specify that an error Js to be signaled, indicating that the data base change 
specified by a user is incorrect and should be rejected. A system-generated or user-specified 
message may be optionally returned to the user or calling program. This is also* true for the 
other types of violation-action. If the vioktion-^ion is specif led as "error", then an error 
is signaled and a system-generated message is returned (as for domains NAME and DATE). 
Domain SEX has a violation-action clause that specifies error signaling with a user- 
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specified error message. If a system-generated message were desired the specific message 
could be replaced by "SYSTEM-GENERATED". A system-generated message can be of 
the form "the definition of domain SEX is violated", or can bear more information if the 
system is a bit smarter (eg., "the definition of domain SEX is violated, it consists of only 
the two values 'female' and •mate'"). The "substitute" violation-action allows a constant 
value to be substituted as the new value of the entry being created or changed (as for 
domain MONEY). A calculated value, obtained v& string or number operations, can also 
be substituted (as for domain ITEM). In the specif ication of this calculation, V represents 
the value that is being checked to determine if it is in the domain. The calculated value is 
then checked to make sure that it is in fact a valid domain value; if hot, then an error is 
signaled (to avoid infinite recursion^ The def tnition of domain 0JJAN offers an example 
of an external procedure call violation-action. 

3.3. Implementation Considerations 

The domain definition language processor translates domain def initions into an 
internal form used in semantic integrity checking. Tne semantic integrity subsystem has the 
responsibility of determining what checking is to be done whenever some data base change 
request is issued by a user. It must also assume the responsibility of performing this 
necessary checking. Whenever a new entry is created in a column (e.g., by an insert row 
operation) or an existing entry in some row is changed (eg., 'by an update row operation), 
the system must make sure that this new entry belongs to the underlying domain of the 
column in which it occurs. The information in the description clause of the underlying 
domain of the column is used for this purpose. If the domain description is violated, the 
information in the violation-action clause is used. The ordering information is used when 
comparing two values in the same domain, as discussed in chapter 4. 
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A domain definition may be used to obtain the information necessary to construct 
several internal relations, which are used by the semantic integrity subsystem to facilitate 
domain definition checking: 

L The domain definition relation contains a single tuple for each domain of the data 
base; this relation has the following columns (with primary key domain name): 

a. domain name, 

b. description type, which is "simple" if the domain has one nonlabeled subunit 
with no where restriction, otherwise "complex", 

c global where restriction, 

d. violation-action type, which is "error", "substitute", or "call", 

e. violation-action modifier, which for violation-action type "substitute" is the 
value (constant or calculated) to be substituted, for "calf is the name of the 
external procedure to be ailed, otherwise "null", 

f. error/warning message, which is either a constant (user-specified message), 
"system-generated", or "null", 

g. ordering type, which is "atomic", "none", "subunit" (for subunit specified 
ordering), or "call" (for external procedure all ordering), 

h. ordering procedure name, which is the name of the external ordering 

procedure if the ordering type is "call", otherwise "null". 
Z The subunit definition relation contains a tuple for each subunit of each domain; 
this relation has the following columns (with primary key domain name, subunit 
index): 

a. domain name, 

b. subunit index, which is the ordinal number of the subunit in the domain 
definition. 
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c subunit type, which is either "constant" or "variable", 

d. label, which for constant subunits is "null", 

e. variable subunit class, which is "number", "string", or "oneof", and "null" for 
constant subunits, 

f. subunit where restriction, "null" if none exists, 

g. ordering index, which is the ordinal number of the subunit in the ordering 
clause, and "null" if this subunit is not referenced in the ordering clause. 

3. The oneof constant relation contains a tuple for each constant in a "oneof" 
description of domain values or domain subunit values (for each domain in the data 
base with such a "oneof" description); this relation has the following columns (with 
all columns in the relation as primary key): 

a. domain name, 

b. subunit index, 

c. oneof constant, which is a constant in the "oneof" list for the subunit 
identified by the subunit index (for the domain specified by the domain name). 

Domain definitions may be utilized to automatically determine the appropriate 
physical storage type to be used to represent values in a domain. For strings, a fixed length 
character string representation can be used when possible, such as when domain values are 
enumerated (via "oneof"), or when an upper bound is placed on the length of string values 
in the domain. In other cases, varying length character strings can be used. For numbers, 
it may be necessary in many cases to make a compromise for efficiency. Integers ("number 
where integer") may be represented by a fixed binary storage scheme (e.g., single word 
binary), but it must be clear that this is only an approximation to the domain definition. A 
similar situation exists for real numbers: a float binary representation may be used for 
storage. 
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3.4. Extensions 

Important issues to be considered in future research on domain definition include: 

1. It is possible to extend the domain definition language so that previously defined 
domains may be used as subunits in the definition of a new domain. If this 
hierarchic approach is used, care must be taken by the system to retain domain 
definitions until they are no longer referenced in any other domain definition. 

2. It may be useful to introduce domain operations. In this approach, operations are 
defined for each domain, and manipulation of values to the domain is restricted to 
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the specified operations. This approach is similar to the notion of abstract data types 
of Liskov and Zilles DLiskov 19741 It may be argued that the approach taken in this 
paper is still too representation-oriented. For example, values in the domain 
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MONEY may be strings or numbers, but this is irrelevant with respect to abstraction. 
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The important properties of the values constituting a domain may be best 
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characterized by specifying the operations that are defined on the values in the 

domain. Of course, in this case a domain will no longer be defined as a subset of 

one of the natural domains (string and real number), and the standardized set of 

domain operations (such as V, V, V, etc) will probably no longer be appropriate. 

3. It may be advantageous, in some cases, to defer the checking of domain definitions, 
and not report violations at the time the data is actually entered into the system. For 
example, in the case where a data base is being "bulk loaded" or updates are being 
"batched", it may be desirable to report all violations of domain definitions at a later 
time, say to an interactive user or as part of a summary report. 

4. The modifiability of domain definitions is a very important issue. It should be 
possible for the definition of a domain td be changed as the corresponding 
abstraction changes. If this is allowed, then it is necessary to verify that all entries in 
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columns having a given underlying domain satisfy the new definition of that 
domain. 

5. It is possible to call an external procedure to verify that a value in question belongs 
to a domain. An external procedure call may also be used in the ordering and 
violation-action specifications. However, we have no guarantee that the external 
procedure is correct. Some reliability is nonetheless guaranteed by the fact that this 
external procedure must use the normal data base system interface. In addition, the 
domain definition is again checked after the external procedure has terminated. 

6. The problem of implementing the domain definition scheme and evaluating its 
effectiveness and efficiency has yet to be fully addressed. 

7. It may be useful to consider the automatic generation of domain definitions by 
attempting to generalize upon a few examples of domain values which are given by a 
user. This is, of course, a part of the general problem of the detailed specification of 
the user interface which supports the construction of domain definitions. 
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i. RELATION STRUCTURE 

Relation structure specification U the description ef the fundamental structure of the 
(base) relations of a data base. When a relation is created, at least, the following autsc be 
done: 

1. The relation must be given a name, which is unique with respect to aU names of 
relations in the data base. 

2. The number of columns in the gelation mutt b« specif ied> 

3. Each column of the relation must be wigned a vmtoue i»am« (unique with retpect 
to the names of the columns of thtrtlation). 

4. The name of the underlying domain of each column must be specif ied. A 
definition for each domain thus referenced must exist at the time the relation is 
created. 

It i$ possible to include other types of information as a part of the fundamental 
structure of a relation. For exarapkythe primary k<v [Codd 18701 of the relation may b* 
identified. However, at the level of abstraction at which our discussion of semantic: 
integrity is focused* the identification of the primary key may be viewed as a type of 
relation constraint (and expressed as such). Furthermore, there is no compelling reason for 
distinguishing the primary key from other candidate keys [Codd 19701 It is most logical for 
a primary key specification to be viewed as a relation constraint* as is the case for other 
types of functional dependencies. 

Many higher level semantic models for data bate design and abstraction (data 
definition), e.g., [Smith 1978], conilder certain type* of reUuion constraint* (such as 
functional dependencies) to he special Functional -dependencies are one important type of 
constraint, but there are other types which may be equally important (in some application 
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environment). We believe that it is essential to provide for a broad spectrum of relation 
constraint types, and to integrate the formulation of these constraints with the process of 
data base design and abstraction. In chapter 6, our approach to rthtion constraints is 
further discussed. 

4J. Additional Column Information 

In addition to the column name and the name of its underlying domain, it is useful 
in practice to allow two additional attributes to fce SMdcfetttf with each column: 

1. a narrative detcription bf the column, for docdmentation purposw, 

2. an indicator specifying whether "nulT (undefined^ values may be present in the 
column (thus allowing "null" vahles to be lekctiwHy prohibited from columns). 

4.2. Comparability 

The kinds of comparisons and manipulations of column entries that are allowed 
relates to the semantic integrity requirements of a data baw. The term comparability is 
used herein to refer to the general problem of determining when arid how two or more 
column entries may be compared or otherwise manipulated by structured operations. There 
are two basic types of comparisons: intradomain comparisons and interdomain 
comparisons. 

Intradomain comparisons are those in which two values from the same domain are 
compared. In this case, the information in the ordering clause of the domain definition is 
sufficient to determine how the comparison Is to be made. 

Interdomain tomparisons are those in Which two values from different domains are 
compared. In this ease, values are compared as atomic string* or numbers using a domain 
conversion, as defined below. 
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4.2J. Domain Conversions 

A data base has associated with it a set of domain conversions. Each domain 
conversion is specified by means of a domain conversion modiB*. Each such conversion is 
a specif ication of how values in a given domain ,ar§ converted into "equivalent" values in 
another domain, and vice versa. Explicit specif iation of domain conversions is necessary 
because values in different domains belong to dif f ««ot abstraa seu, and converting a 
value in one domain into an "equivalent" value in another requires knowledge «f the 
precise nature of the abstract sett corresponding to the two domains Involved. For example, 
both FEET and INCHES are numbers, but they cannot be meaningf uUy added without the 
use of an appropriate conversion. 

Domain conversions are defined independent of the domains (and relations) of a 
data base, in the sense that domain conversion module* have no access to the internal 
details of a domain definition; domain conversions thus map atomic values in one domain 
into atomic va|ues in another. Domain conversion moo>Jei can be dynamically created, 
deleted, and modified, with the restrictions that: 

1. both domains referenced in a domain conversion module <nust exist at the time the 
conversion is created, 

2. if either of the domains referenced in «(he domain conversion is deleted, tl»# 
domain conversion is deleted. 

For the purposes erf this thesis, it is assumed that domain conversion modules are 
written in some high level programming language^ TJiis language may be a specialized 
one, similar to the domain definition language. For generality, it is permissible to allow this 
language to invoke external procedures written in a nigji level general purpose 
programming language 

For example, a conversion for domain DOLLARS and 
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THOUSANDS_.OFJ0Oi.LARS can be defined as: 

domain conversion DOLLARS, THOUSANDS OF JDOLLARS 
DOLLARS - THOUSANDS_OFJX>LLARS • 1000 
THOUS«4DSJ3FjSOLLARS * DOLLARS 1 l606 

Conversions may be unidirectional as weft as bidirectional, and this is the reason for 

the seemingly redundant specif fcation hi the above example: for more complex types of 

conversions, external procedure! m*y be used; for example, we may have: 

domain conversion DATE, JULIANJ>ATE 
DATE - ptnULIAN JDATE) 
JULIANJ5ATE -p?(DATEJ 

where pi and p? are external procedures. 

Structured operations may perform various types of domain comparability operations 
on entries in a data base. The standardized set of such domain operations includes "-", 
W. V, ">»-, V, "<-", V,-v V, T. W, and string and user-defined operations. For 
example, some structured operation may check to See If, for some tuple in relation R, the 
entry hi column A islarger than then entry in column fc. ^t is assumed tha ooth columns 
A and B contain numbers.) 

Whether or not value* from different domains may be utilized together (compared or 
otherwise manipulated) depends upon the nature of the domains and the particular type of 
operation that is to be performed on the* values in those domains. In order to establish a 
first approximation to a set of comparability rules (for the standardized set of domain 
operations), three types of comparability are distinguished: 

1. equality-type, which is invoked when one of the following types of manipulations 
■ occurs: 

a. values are compared for equality ("-") or inequality fW). 

b. numbers are added ("♦") or subtracted ("-"), 

c sets of numbers are manipulated via set operations, such as "maximum" and 
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"minimum", 

d. sct» of values are manipulated by "union", "intersection", or "difference", 
2. ordering-type, which is invoked when values art compared via V, "<-". ">", or 

> m , 

S, mixed-type, which is invoked when values are manipulated via multiplication <"*"). 

division (7"X exponentiation (W), or any wring operation or uter-defmed operation. 

Equality-type comparisont are always allowed if the ;tw§ivaloe* being compared (or 
manipulated) are from the same domain, le, tf the values are from At same column or 
from columns with the same underlying domain. If *he values are not, from the same 
domain, Le* they are from distinct column* with different underlying domains, then they 
may be compared if and only If * domain conversion ex4stt befweefMhote domains. (All 
domain conversions roust be explicitly defined.) The domain conversion is used to convert 
the value in one of the domains into an "equivalent" value in -theother domain, and the 
resulting values are then compared. (Another type of conversion could be supported, by 
assigning units to each column, and clef Ining uniu converskxu [McUod 1976bl) 

Ordering-type comparisons are allowed if two values ant f roe* the same underlying 
domain and the ordering of that domain is not "none". The ordering information In the 
domain definition is used to determine how the values are to be compared. Orderlr>g-type 
comparison! are also allowed if the two values are from different a^mnx, these columns 
have different underlying domains, and a domain conversien exists between those two 
underlying domains. In thjf case, the values are compared by using fhe domain conversion, 
as for equality-type comparisons, In any other case, wderiog-type cqenparisons are not 
allowed. 

Mixed-type comparisons are always altowed. Values can always be manipulated by a 
mixed-type operation (with no restrictions). Values that ere numbers may be multiplied, 
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divided, and exponentiated with no limitations, except of course for the requirement that 
the values be lumbers. Although nambm may be added* and subtracted onlf when they 
have the same "units", multiplication, division, and exponentiation can be performed 
without any such restriction. It presumably makes sense to divide a value in domain FEET 
by a value in doinairPQUNZ$, but ft is (nw^ values. 

For mixed-type eompatfecfts, laities being iTftniptfetett are treated as atomic and domain 
conversions are not used. Note that if user-defined domain operations are avowed, they 
may be placed in this category by default More ge^Wrafly^ ft mafbe best to allow the user 
to specify the comparability type (equality, ordering;;©* rifiied) of each user-defined domain 
operation. ~ 

If the user wishes to sate an unusual type of query, such as asking for all employees 
whose name is the same as the name of their department; the user miey be allowed to •force" 
the comparison, by explicitly overriding the restrictions. Entries in the two columns are 
then compared using the default numeric or lexicographic ordering, treating the values as 
atomic numbers or strings, respectively. The idea is to permit the system to be flexible and 
not to allow comparability rules to get In the way when they shtiaKl not. The best approach 
may be to warn the user that an operation may be meaningless, btit allow it to proceed if he 
demands it (The semantic integrity of the data bake is not really in danger anyway). 

Domain conversions are also useful when a structured operation retrieves an entry 
from some column of a tuple in a relation and assigns irto Be the new value of some other 
entry (in a different column of some tuple in a relation). For example, suppose that the 
date an item was shipped by some company (the entry in column D§te of relation ORDERS 
in the example data base of figure 1-2) is to be copied into the Date column of another 
relation, say BIG_ORDERS. (BIG.ORDERS records all orders which request over $1000 
of merchandise.) The Date column in BIG.ORDERS has underlying domain 
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JULIAhLPATE (i.e^ dates of the form "76.134"), while the Dattoriumn in ORDERS has 
underlying domain DATE (Lt, dates of the form "1/20/1976"). Thus the domain conversion 
from DATE to JULIANJDATE can be used to effect the desired assignment. 

The general rule for an assignment which tak.es the entry in a column (A) and 
assigns it as the new value of an entry in another column (B) is as follows: 

L If A and B have the same underlying domain, the assignment is performed with no 

conversion. 

2. If A and B have different underlying domains, then: 

a. if a domain conversion exists from A to B, the conversion is used tt> affect 
the assignment, 

b. if no such conversion exists, the assignment is not allowed. 
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5. STRUCTURED OPERATIONS 

A very important aspect of data base semantic integrity is the set of operations a user 
may employ to examine and manipulate the data base. It is possible to describe a user's 
view of a data base as consisting of data structures plus operations. Alternatively, one may 
conceptually characterize the user's abstract view completely by a set of abstract operations, 
as is done in abstract data types [Liskov 1974]. These operations provide a behavioral 
specification of the semantics of the data base. 

For these reasons, the concept of a structured operation is included in our approach 
to semantic integrity. The principal purpose of a structured operation is to embody a 
conceptual data base transaction: an action which is meaningful and permissible in the 
context of the application environment. For the example data base of figure 1-2, structured 
operations may include: hire_employee, fire_employee, raise.salary, place_order, 
create_new_department, etc 

5.1. Semantic Integrity Information in Structured Operations 

One approach to preserving the semantic integrity of a data base is impose the 
restriction that the operations that may be performed on a data base are only those in some 
given set. This set of operations should be defined so that it contains only meaningful 
actions. However, the approach of allowing only semantically meaningful operations has 
several problems: 

1. Operations which are not semantically meaningful in the context of the application 
environment must be allowed, e.g., to permit errors to be corrected. 

2. The set of operations that are to be allowed may depend upon some characteristics 
of the data base state. For example, the set of operations Ol may be legal if the data 
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base is in state Si, but if the data base is in state S2» the set of legal operations may 

be02. 

S. The uses of a data base are not fixed, but rather eVofve with time. Operations 

change and new operations need to be created. If Ate* semantic integrity information 

is embedded in these operations, a scan of alt data base operations may be necessary 

to make such modifications. 

4. Often data is maintained in a data base before uses for it are discovered, ffhus it 

is difficult to characterize the data via a behavioral semantics approach; in some 

sense the semantics of the data is known, but the exaeJMii^lt^'^^'seti^.^i^ions 

on that data is not 

5.2. The Definition of Structured Operations 

Despite the problem* mentioned above, it U important to be able to define a set of 
abstract operations on a data base. To this end, we allow structured operations to be 
defined. Structured ojieragons ..are constructed using: 

1. the primitive data base operations (e.g., see figure 1-SV 

2. statements in a very high level data selection (query) and data modification 
language, such as SEQUEL (or QJJEL or Query by Example). 

Structured operations are ordered lists of: primitive operations, statements in a data 
selection and modification language, and previously defined structured operations. 
Allowing previously defined structured operations within new "operations enables a 
hierarchic organization. 

For the example data base of figure 1-2, a structured" operation to raise an employee's 
salary could be defined: 
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operation raise_salary (employee jiame, new_salary) 
update EMP 

where Name - employeejiame 
Salary - new_salary 

This structured operation consists of a single SEQJJEL-like statement, which updates the 

Salary column of the tuple in EMP with a value in the Name column equal to the first 

parameter of the operation (presumably there is one such tuple). The new Salary value is 

specified as the second parameter. 

Consider an operation to place an order (again in the context of the example data 

base of figure 1-2): 

operation place_order (customerjd, itemjd) 
insertjuple (ORDERS) 
Item - itemjd 
Customer - customerjd 
Date_shipped - date() 
Orderjnumber - generate_order_number() 

In this example operation, a tuple consisting of all null values is first created, and then its 

columns are given values. Note that two external procedures are called, one to return the 

current date and the other to generate a unique order number. The types of names 

(identifiers) used in the definition of the operation include those of parameters, a relation, 

columns, and external procedures. 

The operation check_creditjand_order could be defined as: 

operation checkj:redit_and_order (customerjd, itemjd) 
if check_credit (customerjd) 

then place_order (customer, item) 
else error 

The operations check_credit and place_order used in this definition are assumed to have 

been previously defined. Note that this operation contains a conditional expression: a 

useful construct we may include in the structured operation language. This of course 

motivates the need for other types of constructs, e.g., for iteration. We may for instance 

want to have an operation that takes an arbitrary number of items as parameters and 
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places an order for each. 

Thus, in general, it might be desirable to have a structured operation language which 
has many of the capabilities of a general purpose programming language; We could 
consequently allow structured operations to be wriatn in some high level genera! purpose 
programming language. The details of this are not persued here. 

One important point to note in pasting, U that structured operations are important 
with regard to the specification ef when relation constraint assertions are to hold (be 
checked). This is further di scu ss ed in chapter 6. 
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8. RELATION CONSTRAINTS 

The fourth aspect of semantic integrity in a relational data base system concerns 
relation constraints. In this chapter, the requirements for relation constraints are detailed, 
and an approach to their specification is presented. 

Codd ECodd WHbf&xM m&hH identified, the "third normal form" of relations 
ECodd 1974a]: "A relation & is in third normal form If itiis fin flat normal form and. for 
every attribute collection C of R, if any attribute not in C is functionally dependent on C. 
then all attributes in R are functionally dependent on C." Third normal form facilitates the 
straightforward expression of some types of relation constraints, namely functional 
dependencies. But the class of data properties describable via functional dependencies is 
limited. 

Boyce and Chamberlin [Boyce 1973a] observed that a high level language, such as 
SEQUEL [Chamberlin 1974b, Chamberlin 19751 may be used as a vehicle for the expression 
of data properties other than functional dependencies. SEQUEL expressions were shown 
to be useful in expressing such types of properties as "uniqueness of key", "functional 
dependency", "validity check", and "inter-relational constraints". 

The integrity assertions of SEQUEL [Boyce 1973a. Eswaran 1975], INGRES 
[Stonebraker 1974& and Query by Example [Ztoof 1975b] are used to express varied types 
of data properties. However, these facilities basically provide for the unstructured 
specification of arbitrary predicates. Although the assertion expression capabilities of 
SEQUEL and INGRES are "complete", they do not allow for the analysis of the types of 
possible assertions. 

Furthermore, the assertions of SEQUEL and INGRES are rather inflexible with 
regard to when they are to hold, and what action Is to occur if they do not. In SEQUEL 
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and INGRES, if a data base change is specified which would cause some assertion to be 
violated, the data base change is immediately rejected and an em>r signaled CEwaran 19753, 
or the data base change is modified such that the assertion will be satUfied (Stonebraker 
W75cl 

In response to this Utter objection, a relation constraint is herein defined as an 
abstract statement, having three components: 

1. the assertion (a property), which is a predicate oo the state of the data base or on 

transitions between data base states, 

2- the yaJ^iE reouhremedL which specifies thfroecasionU) at which the assertion is to 

hold, 

3. the violation-action, which is the action that is to occur if the assertion is not 

satisfied at a time when it should be. 

In response to the former objection, a deuiled classification of relation constraints is 
presented below. The emphasis is placed on providing a structured framework, which may 
be used to construct a high level, abstraction-baaed, weU-directed, and dUcipiined relation 
constraint specification methodology. In so doing, a principal' goal is to impose some 
structure on the problem of semantic errors in data bases. In this approach, iris important 
to keep "an eye toward implementation", although no specific implementation considerations 
are included in this thesis. 

6JL Whither Assertion Structure? 

We subscribe to the view -.that the assertion component of a data tease relation 
constraint should not be viewed at in arbitrary predicate of the firtt-onaer predicate 
calculus, ranging over tuples of the relations of a data base. Rather, every assertion should 
haVe a well-defined, uniform structure. There are several advantages to taking a 
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disciplined approach to assertion expression: 

1. It provide* the data base administrator (or other authority responsible for 
expressing the constraints) with a conceptual framewor* tn terms of which to 
organize his thinking and structure the formulation of assertion specifications. 
Reducing abstract, proWera-oriented limitations on configurations of the application 
environment to concrete restrictions on vatoes in the data base is essentially a 
programming problem. By providing the >^ra*nrher* with a theoretical and 
general framework for his problem, it is poufbie to flgntf fcantly ease his task. 

2. The issues of constraint specification which : -a*e endttary to assertion expression, 
namely the validity requirement and violation-action, cannot be satisfactorily 
addressed in the absence of the kind of structure" proposed herein. The degree to 
which a semantic integrity subsystem can respond "inteiHgently" to a constraint 
violation depends upon how weH the f ormulatfefl of the constraint captures the intent 
of its ex pressor. ■- ..■■■?■- -.i; : . t -r. 

S. A useful conceptual framework for auerttons w« provide some measure of the 
complexity of individual assarttont, providing th«r e*pressor with a guide to the cost 
of their impkroentation. Indeed, the structure of tfi assertion can be used by an 
implementation facility as a guide to the strawgy for the impfemerrtation of its 
checking. 

It is important to note that insuring that there is a single, unique specification of a 
given conceptual constraint is not a major objective here. ttatfcer, the emphasis is placed 
on encouraging a "reasonable" formulation, one which accurately models the application 
environment abstraction and which U useable by an im phm en tation fadHty. 
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6.2. Relation Constnint Assertions 

The auertion component of a relation constraint 4* a logical predicate on the state of 
the data base or transitions between data base states. » expresses sort* semantic property of 
the data base. 

Each assertion is either a simple assertion or a combination of simple assertions (a 
derived assertion) . Simple assertions may be combined using boolean operators and other 
connectors (such as "M then else"). The remainder of this section "deals with simple 
assertions; the generalisation to derived assertions is moi » or leu straightforward. When 
no ambiguity is possible, "auertion" will boused in plaot of "simple auertion". 

6.24. Simple Assertions 

Every (simple) assertion may be viewed as delimiting certain values of the data base 
in terms of certain others. That is, an assertion does not merely express some relationship 
among different values In the data base. Rather, it tingles out certain values, and identifies 
them as being the constrained data of the predicate. The predicate deSrhits the legal values 
of the constrained data in terms of the constraining data. Thus, every assertion constrains 
some data with respect to some other; the two are not being bilaterally restricted. 

As a consequence, there are two dirtinct steps m the proem of staffing an assertion: 
1. The data that is being constrained is described. This description is accomplished 
in two sequential substeps, in which the feHowmg are identified: 

a. the set of all data objects In the data base that are being restricted (the 
constrained collection) . 

b. the precise aspect of each of these data objects that is being delimited (the 
restricted expression) . 

Part a of step 1 utilizes data selection predicates. The predicate expression 
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capabilities of any data selection or query language may be adapted 10 accomplish 
this task [Chamberlin J®#h CbamberKn HflkjCodd/ IdTio, Codd &&&, Hall 1975. 
McLeod 1976c, Held I9#>h, Zloof 1974, Zloof t97fcd For example, consider the 
assertion that the salary of each employee in the sates department is lest than the 
salary of bis manager. Here, the constrained oolieaion consisu of those tuples in 
relation EM P whkh have "saiet" in t*»e Department cetomn. The restricted 
expression i**h^Salary;,e«try of each such iupte. The necessity of fir* identifying 
the constrained collection and then the restricted express^ is occasioned by more 
rich and complex assertions, as dfectfsse&bejom ■/■<*.■ « s s 

2. The actual predicate of the assertion is stated, which asserts a restriction on the 
value of the restricted expression for each member of the constrained collection. The 
predicates used therein are called tuwttan medi cates. In general, this restriction 
depends on other data in the data Jfease, The other data, which participates in the 
assertion is called the, constraining dam, and the expression which computes the 
precise delimit^ value, is called the, rettrictimr exf»os>i oft. for example, for the 
assertion above, the constraining data (for each ttipk) is the, tuple in relation EMP 
whose Name fntry «HjuaU |he Manager entrj of the ccrmrained tuple; the restricting 
expression is the Salary entry of the constraining tuple. 

Figure 6-1 contains some example* of simple assertions. For each assertion, the 
constrained collection and assertion predicate are identified. Note that the "language" used 
to specify the assertion predicates is intended only to be illustrative, but is more-or-less 
consistent with the "level" of (and directly translatable into) relational data selection 
languages such as SEQUEL* &UEU and Query by Example. 
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6.2.2. Identification of the Constrained Collection 

As introduced above, the first step in thf specification of an assertion is the 
identification of the constrained collection: that which U conceptually being delimited by 
the assertion. In general, the constrained collection is a collection, of data objects, and the 
assertion applies to each of them. In this sense, every assertion is in effect an assertion 
schema, which is instantiated for each element of the constrained collection. 

An assertion may either express a property of an individual tuple (a tuple a ssertion) . 
or a property of a set of tuples considered as a whole (a set assertio n). In figure 6-1. 
examples 1-4 are tuple assertions, while examples Wf are set assertions. 

The constrained collection for a tuple assertion is a collection of tuples, to each of 
which the assertion applies. The constrained collection for a set assertion, similarly, is a 
collection of sea of tuples. The set assertion appUes to each tuple set in the constrained 
collection. An important (and frequent) special case, of a.*et assertion is that in which the 
constrained collection consists of a single set Note the difference between this special cast 
and a tuple assertion: in the former, the assertion, applies to the tuple set as.a whole, w hile 
in the latter it applies to each individual member of it. Thus, in example l the corutrained 
collection has many elements* each of which i* a tuple of .the EMf re lotion; In example 5, 
the constrained collection consist* of a single element, which %%^entire. EM?, Relation; in 
example 6, the constrained collection has several elements, .each of which is a subset of the 
EMP relation. 

Both for tuple and set assertion j, dtfining the constrained colleaion begins with 
identifying some set of tuples (called the ^trlyir^g refef|oq of the assertion). This tuple 
set can then be manipulated by means of data selection i predjca^ tn^^ 
constrained collection. 

The underlying relation of an assertion need not be a relation defined as part of the 
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data base. In general , it may be any of the following: 

1. a base relation (a relation explicitly present in the set of data base relations), 

2. the Cross product of two or more base relations, 

3. the union of two or more base relations, 

4. the cross product of two of m^ at least one of which is 
not a base relation, 

5. any rektton *hiefi can be defined in Kims of base relations, not included in the 
above (these relations may be constructed using the various selection criteria and 
retrieval operators of a da^ selecifeh teuigua^l. ; ' 

For example. EMP iTi teuton of tyjie tfEMP crols SJ fiitll>6Et is bf type 2. An example of 
a relation of Kype 3 wodtd* be the infbn of mtm Mit<rfrj.UP and OLDJEMP 
(where both htfve th^stml strtfttUri ai IM^P^'MmpiW r relation of type 5 is 
SALJTOTAt tDep*rnnent; Sum salaries of 

empl6yee*wbrRif%forth«as^fateddep4r^ ,,.?>* 

The foregoing classification of underlying relations Irlh order of increasing 
complexity, and exhnWn%*lfftrent%md* of fi&^^ttfitniclr assertions may apply. It 
is important tb cfcserv* that an SSiertioH need dot Sppty'id I rflat&n explicitly present in 
the data base; but maf hofcf for %' derived ¥eta1f6hf * ! s * T 

Once the underlying relation is defined, the precise specif ication of the constrained 
collection can be aceomptbned. In the^rt bt tu^M#«lfhs; the constrained collection is 
obtained from th# tmdetlyin^ Wkflon *y; hearts "of *&)» Msetibn predicates. The 
complexity of the seleoiort ptotew *ah be described intermi V fhe operators i of the data 
selection language. Selection of the constrained collection Is a problem tri'the specification 
of a relation. 

However, in the case of set assertions, there is a need to specify a collection of tuple 
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sets; each such set is a member of the constrained collection. For illustration, consider the 
following tentative taxonomy of the first stage of the specification proceu for a constrained 
collection whfch consists of tuple sets: 

L The constrained collection may contain a single set of tuples, selected from the 

underlying relaUon. (simp le set) 

2. A set of tuples may be selected from the underlying relation, and then divided into 
groups, eg., by common value in one or more columns or by; intervals of column 
values (such as 7X < Age < 30, 31 < Age < 40, etc). Certain of these groups may then 
be chosen based on properties they possess. The constrained collection is thus a 
collection of tuple sets, namely the groups that were so chosen. The assertion then 
applies to each tuple set in the constrained collection, (grouped sej) 

3. A set of tuples may be selected from the underlying relation, and those subsets of it 
which satisfy a specified property are chosen. An example of, such a property might 
be that the number of tuples in the subset eo^iaU three. These chosen subsets 
comprise the constrained collection, and the assertion is applied to each of them. 
(proRerty-def jne^ sej) 

There is a noticeable degree of flexibility in the foregoing framework for identifying 
the constrained collection, in that it does not impose a rigid specification methodology on 
the expressor of assertions. The criterion of completeness would not demand all the options 
for the underlying relation allowed above; it is clear that any assertion can be satisfactorily 
specified by letting the underlying relation be the cross product of all the base relations and 
performing various operations thereon to compute the constrained collection. However, in 
many instances such an "alfcat-once" approach would be cumbersome and unnatural. It 
might be more convenient to follow a " top- dov qo" . step- bv-st ep approach and define a 
sequence of derived relations, the last of which is the underlying relation. This can 
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facilitate the straightforward expression of the assertion. 

Consider the following assertion: th« sum of salaries of employees of each 
department is less than the budget of that department Ah ail-at-0tt& approach to 
expressing this assertion would proceed to identify the constrained cofoectldri as the set of 
tuples in EMP, grouped by common Department (grou^s^-The restricted expression 
would be the sum of the Salaries (for each group): The isJeVtldn predicate is then 
"sum(Salary) < B13DGET. f Satary_budget whit* BUDGET. Department - 
commo^_valtt«j)f (Department) (In the cbnsfrainertapfc lit)*. Ifous tto cbristfaining data 
is the tuple in BUilGET having thtDep^rtmeftticolOrni^' to the common value 

of the entries in the Department columntor <th*cdfts^heo^tuf8s sit, and th* restricting 
expression is the Salary J>udg«t column entry of the CciWateg&ipleT 

A top-ddwrt, st£p-fty*step approach to thfixprMol or* the abov* assertion may 
proceed by nothig that thi assertion could be expr*lsedl^ 

a relation of the form DEPARTMENTS U^ptrtmWt; SumLof;jsmp_saiaries, 
Salary.budget). If such a delation exTfced, the t^strainelic^lectt would be each tuple in 
relation DEPARTMENTS. The restricted expression would b* the column entry 
Sum_of_ertpjalarte*; Th# asietllbn ^ < 

Salary Jwdget*. Heretr«*re*riclifig expression H m cdlumh r entry Sa^Jbriitfget in the 
constrained tuples and the comtrata^ 

However; the urtatlbh DEPARTMENTS does h<£ ^^Wisl ^se^eritly; W is necessary 
to specify how it is to be dertvedfrom existing base f&ftohs. f^e^rtdertying relation of 
the constrained collection is thus a derived relation, l.e; title relation DEPARTMENTS. A 
data selection fehguagW would be used to construct this derive* relation; for example, the 
specification could bent a SEQUEL-like language: 
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DEPARTMENTS (Department, Sum_of_emp_salaries, SalaryJ>udget) . 
select EMP.Department, sum(EMP.Salary), BUDGET.SalaryJ>udget 
from EMP, BUDGET 

where EMP.Department - BUDGET.Department 
group by EMP.Department 

6.25. Tuple Assertions 

It is now appropriate to examine more closely the structure of tuple assertions. In this 
case, the constrained collection is a collection of tuples, obtained from the underlying 
relation by the application of data selection predicates. The assertion predicate then applies 
to each individual tuple in the constrained collection. Tuple predicates are used to specify 
tuple assertions. The restricted expression defines that aspect of each constrained tuple that 
is being delimited. In the simplest case, the restricted expression is some column name of 
the underlying relation. More generally, it may be an expression: an appropriate 
combination of column names, system-provided operators, and user-defined operators. 

It may be possible to formulate a given conceptual assertion in different ways, with 
different restricted expressions. For example, though the tuple assertions "Creditjine - 
Debt < 50000" and "Creditjine < Debt ♦ 50000" are logically equivalent, in the former case 
the restricted expression is "Creditjine - Debt", while in the latter case it is just 
"Creditjine". This flexibility enables the assertion expressor to precisely identify which 
data values are to be regarded as dominant, and which as subordinate. In the first case, it 
is a combination of the entries Creditjine and Debt that is being delimited, while in the 
latter case it is simply the Creditjine entry. This distinction contributes to the abstraction 
power of assertion expression, and has implications for the implementation of constraints 
and for the actions that are to be taken upon the detection of an assertion violation. 

The value which delimits the restricted expression is the restricting expression, which 
is computed from some data values which may reside anywhere in the data base. In 
particular, these data values (the constraining data) may be outside the constrained tuple. 
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Tuple predicates may be classified on the basis of the relationship between the 
constrained collection and the constraining data: 

L A tuple predicate is tocaKD if the constraining data is present in the constrained 
tuple. That is, for a local tuple predicate, all data referenced in the predicate is 
within the constrained tuple itself. 

2. A tuple predicate is nonlocal independent (NI) if the constraining data is data 
selected from elsewhere in the data base, but whose selection does not depend on any 
data in the constrained tuple. 

3. A tuple predicate is nonlocal dependent (ND) if the selection of the constraining 
data does depend on data in the constrained tuple. 

In figure 6-1, examples 1 and 4 involve L-type tuple predicates, example 2 is an NHype 
tuple predicate, and example 3 is an ND-type tuple predicate. 

This classification is in order of increasing complexity. For L-type tuple predicates, 
one has only to look at the constrained tuple to determine the restricting expression; the 
constraining data is present in the constrained tuple itself. For type-NI tuple predicates, 
this is no longer the case. The restricting expression is now computed from data arbitrarily 
located in the data base, not confined to the constrained tuple. However, the data from 
which the restricting expression is computed is the same for each tuple in fhe constrained 
collection. Thus the restricting expression admits of a one-time computation, with the result 
being used for each constrained tuple. For type-ND tuple predicates, the computation of 
the restricting expression depends on data in the constrained tuple. It is therefore necessary 
to recompute the restricting expression for each individual constrained tuple. 

There are two dimensions by which we classify local tuple predicates. The first 
dimension measures the complexity of the restricting expression, and has three levels: 

1. The restricted expression is compared via a scalar comparator to a constant, a single 
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column entry from the constrained tuple, or an expression involving several column 
entries from the constrained tuple, (types 1-3) 

2. The restricted expression is compared via a set comparator to a set of constants, a 
set of column entries from the constrained tuple, a set of single-valued expressions 
computed from entries from the constrained tuple, or some expression which yields a 
set of values and depends on entries in the constrained tuple, (types 4-7) 

3. The restricted expression is compared via a set comparator to a set of constant 
tuples, a set of tuples involving entries from the constrained tuple, a set of tuples 
composed of single-valued expressions computed from entries from the constrained 
tuple, or some expression which yields a set of tuples and depends on entries in the 
constrained tuple, (types 8-11) 

The second dimension reflects the complexity of the restricted expression, and also 
has three levels: 

a. For types 1-7, the restricted expression is a column entry in the constrained tuple. 
For types 8-11, it is a subtuple of the constrained tuple. 

b. The restricted expression is a single-valued expression. For types 1-7, the restricted 
expression is computed from column entries in the constrained tuple, and yields a 
scalar value. For types 8-11, it yields a tuple composed of such column entry 
expressions. 

c. The restricted expression is a set-valued expression. For types 4-7, it yields a set of 
scalars. For types 8-11, it yields a set of tuples. (This level does not apply to types 1-3.) 
Figure 6-2 illustrates this classification for local tuple predicates of types la-Ua. 

Consider the relation R (A. B, C, D, E, F) (where columns A, B, and C have underlying 
domain real number and columns D, E, and F have underlying domain character string). 
Some examples of local tuple predicates may be classified, as follows: 
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l.A<15(la), 
2.A<B(2a), 
S. A<B/C(3a). 

4. A is in {V, y, V} (4a), 

5. A is in {V, E, F) (6a) 

(This means that, for each constrained tuple, the entry in column A is in the set 
containing the constant "x" and the entries in columns £ and F.), 

6. (D, E) is in {(V, Y), ("z", F)} (10a) 

(This means that, for each constrained tuple, the subtuple consisting of the entries 
from columns D and E equals either the tuple (V,"y*), or a tuple whose first 
component is V and whose second component is the F entry of the constrained 
tuple.), 

7. A ♦ B < C (2b), 

8. A + B is in {C ♦ 1, C ♦ 2, C ♦ 3} (6b), 

9. {D, E} intersect {"w", "x"} contains {"y", V} (4c) 

(This means that the intersection of the sets consisting of the entries in columns D 
and E and the constants V and "x", is a superset of the set containing the constants 
yandV.). 

As for local tuple predicates, nonlocal tuple predicates may be classified on two 
dimensions. The first dimension again consists of three levels: 

1. The restricted expression is compared via a scalar comparator to a a single-valued 
expression, which yields a scalar value (and which is computed from data elsewhere 
in the data base), (type 1) 

2. The restricted expression is compared via a set comparator to a set-valued 
expression, which yields a set of scalars. (type 2) 
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3. The restricted expression ii compared via a set comparator ta a set-valued 
expression, which yields a set of tuples, (type 3) 
Again, the second dimension consists of three levels: 

a. For types 1-2, the restricted expression is a column entry. For type 3^ |t is a tuple 
of entries which constitutes a subtuple of the constrained tuple. 

b. The restricted expression is a single-valued expression. For types 1-2, this 
expression is computed from entries in the constrained tuple, and yields a scalar. For 
type 3, it yields a tuple composed of such column entry expressions. 

c The restricted expression is a set-valued expression. For type 2, it yields a set of 
scalars. For type 3, it yields a set of tuples. (ThU level does not apply to type L) 
Figure 6-3 illustrates this classification for nonlocal tuple predicates of types la-3a. 
Note that the computation of the restricting expression (scajarval or setval) is independent 
of the constrained tuple for Nl-type tuple predicates, but dependent for ND-type predicates. 
The data selection language must now serve the added role of identifying the constraining 
data. For this reason, the classification is coarser for nonlocal tuple predicates than for 
local tuple predicates. 

63.4. Set Assertions 

For set assertions, the constrained collection is a collection of tuple sets, obtained from 
the underlying relation, as discussed in section 6.2.2. The assertion predicate then applies to 
each tuple set in the constrained collection. Set predicates are used 3 specif y set assertions, 
The restricted expression is that aspect of each constrained tuple set that is being delimited. 
In the simplest case, the restricted expression is the set of entries in some column of the 
underlying relation (eg., the set of Salary entries in EMP). More generally, it m%y be an 
expression: an appropriate combination of column names, system-provided operators, and 
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user-defined operators. These operators include aggregate arithmetic operators which are 
applied to sets of values. 

As for tuple assertions, the restricting expression is the value that delimits the 
restricted expression. The constraining data may be, in general, data anywhere in the data 
base. Again, as for tuple assertions, it may be possible to express a given conceptual set 
assertion in several ways. 

Set predicates may be classified on the basis of the relationship between the 
constrained collection and the constraining data: 

1. A set predicate is local (L) if the constraining data is present in the constrained 
tuple set. That is, the restricting expression may be computed solely from the 
constrained tuple set. 

2. A set predicate is nonlocal independent (NI) if the constraining data is data 
selected from elsewhere in the data base, but where this selection does not depend 
upon the constrained tuple set. 

3. A set predicate is nonlocal dependent (ND> if the selection of the constraining data 
does depend upon the constrained tuple set. 

In figure 6-1, examples 6 and 8 are L-type set predicates, and examples 5 and 7 are Nl-type 
set predicates. 

As for tuple predicates, there are two dimensions on which local set predicates may be 
classified. One dimension reflects the complexity of the restricting expression, and the 
other reflects the complexity of the restricted expression. The first dimension has four 
levels: 

1. The restricted expression is compared via a scalar comparator to a constant, an 
aggregate function of the entries in some column of the constrained tuple set, or an 
expression involving several such aggregates, (types 1-3) 
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2. As in 1, except that the aggregate functions in the constraining expression are not 
computed for a set of scalars, but for a set of tuples; namely, the collection of 
subtuples obtained by projecting the constrained tuple set onto two or more columns, 
(types 4-6) 

3. The restricted expression is compared via a set comparator to a set of constants, the 
set of entries in some column of the constrained tuple set, or an expression involving 
several such sets, (types 7-9) 

4. This is analogous to 3 in the same way that 2 is analogous to 1. That is, the 
restricting expression does not deal with scalars, but with sets of subtuples of the 
constrained tuple set (types 10-12) 

The second dimension consists of two levels: 

a. For types 1-6, the restricted expression is an aggregate function. For types 7-12, it 
is an instantiation of the function "set", which generates the set of values in some 
column or the set of subtuples for some group of columns, taken over the constrained 
tuple set. 

b. For types 1-6, the restricted expression is a single-valued expression computed 
from two or more of the aggregate functions described above. For types 7-12, it is a 
set-valued expression, computed from two or more instantiations of "set", as described 
above. 

A special type of local set predicates, the column relationship predicates, are not 
included in the above scheme. Column relationship predicates are used to express 
properties such as one-to-one correspondences and functional dependencies. To state a 
column relationship predicate, two groups of column names from the constrained tuple set 
are specified. The relationship between these two groups of columns is then stated. For 
example, one may state that for the relation R (A, B, C, D, E, F), there is a one-to-one 



Semantic Integrity Specification 64 

correspondence between the column A and the column group (B, C). This means that there 
is a one-to-one relationship between the entry in column A and the subtuple formed from 
the entries in columns B and C. Note that column relationship predicates are always local. 

Figure 6-4 illustrates this classification for local set predicates, types la-16a. For 
example, for the relation R (A, B. C, D, E, F) (where columns A, B, and C have underlying 
domain real number and columns D, E, and F have underlying domain character string), 
various local set predicates may be classified, as follows: 

L avg(A) < 15 (la), 

2. avg(A) < sum(B) (2a), ' 

3. count(D, E) < 50 (4a) 

(This means that the number of tuples in the relation formed by projecting the 
constrained tuple set on columns D and E is less than 50.), 

4. set(D) contains fx", "y", V} (7a), 

5. set(D) properly contains set(E) union {"y M , V} (9a), 

6. set(D, E) is in {("w", V), ("y", V)} (LOa) 

(This means that the set of tuples obtained by projecting on columns D and E is a 
subset of the set of constant tuples containing ("w", "x") and ("y", V), 

7. D one-to-one (E, F) (14a), 

8. set (D) union set (E) is in set (F) (8b). 

Nonlocal set predicates may be similarly classified. The first dimension has three 
levels: 

1. The restricted expression is compared via a scalar comparator to a single-valued 
expression, which yields a scalar value (and which is computed from some data in the 
data base) (types 1-2). 

2. The restricted expression is compared via a set comparator to a set-valued 
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expression, whiqh jfielcU a set c^scaUrs, (type 3) 

3. The restricted expression is compared via a set comfsajrator jo i a set-valued, 

expression, which yields a set of tuples, (type 4) 

The second dimension consists of two levels: 

a. For types 1-2, the restricted exprewion fc an aggregate function. F«r types H. J* to 
an instantiation of the /unctioii "set*, which generate* the set *>f, valw#* in some 
column or the set of subtup'es for some fcrouookjcojbum^ the constrained 
tuple set 

b. For types 1-2, the restricted expression U a siogle-valued, ex preision computed 
from two or more of the aggrej^te functionj described above. For types 3-4, it is a 
set-valued expression, computed from two or f mQr| instantiations .of "set", as described 
abpve. 

Figure 6-5 illustrates this two dimensional dassif ication for. types Ja-4a, Mote that the 
computation of the restricting expression (scalarval or. jetyaDjs independent of the 
constrained tuple set for N|-type set predicates, but c^«idjs^|for NJ^D^|iredicates. 

6.2.5. Scope of Assertions 

It was stated in section 6.2.2 that each assertion is ac&iajly an assertion Kheina: an 
assertion is instantiated for and applies to each element of the constrained collection fcut 
there is another sense in which an assertion may be viewed as a schema. This is by 
allowing described rather than explicit references to relation and, $%^!in names within aij 
assertion. 

It may be desirable to state a "second order" assertion, e.g., each column in some 
relation of the data base which has underlying domain NAMf must be a, subset. of the 
Name column in relation EMP. This may be handled by* allowing column names (and, 
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relation names) to be variables which range over the set of all columns or relations in the 
data base (or some specified subset thereof). This is basically a universal quantification of 
second order. 

Without proposing a specific detailed solution to this problem of explicit scope vs. 
described scope, we may observe that such a solution must facilitate a second order 
quantification, on a level above the constrained collection. Consider the assertion that, for 
each column in the data base named Ci, every pair of entries in this column sums to less 
than 100. Here the constrained collection is a set of pairs of tuples. The property must hold 
for each element of the constrained collection. Furthermore, the assertion actually applies to 
each element in a set of constrained collections, viz, one such constrained collection for each 
column (in the data base) which is named CI. 

It has been stated that the scope of a relation constraint assertion can either be 
explicit (apply to relations and columns which are constants) or described (apply to relations 
and columns which are variables whose ranges are described). It is certainly valid to 
question the desirability and practicality of assertions with i described scope, and we shall not 
take a position on this matter here. Rather, for the purposes of the remainder of this 
thesis, it is sufficient to assume that we are dealing with assertions having explicit scope, 
although we believe that the extension to assertions having described scope is 
straightforward. 

6 S. Relation Constraint Validity Requirement 

Another component of a relation constraint is the validity requirement(s): the 
occasions) at which the assetion component of the constraint' must hold. 

One possibility is that an assertion must hold at all times, and consequently must be 
checked after any data base change that may cause its violation. Such assertions must 
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theoretically be checked (verified) after every primitive data base change (such as update, 
insert, or delete tuples Assertions actually need to be checked only if some value<$) are 
changed which may cause the assertion to be viohrted. Some success has been achieved in 
automatically determining when an assertion actually needs verification [Eswaran kITS; 
Stonebraker HYTScX 

In some cases, it is necessary to specify than an assertion need not hold during some 
complex data base tranuction(s), because it may not be meaningful to verify the assertion 
until after the transaction(s) are completed. Such assertions are checked only at the end of 
these transactions. 

Suppose, for example, that there is an assertion for the example data base of figure 1 
which states that exactly two employees in the sale* department have a salary of more than 
$15,000. Assume that at some tune the assertion holds, as employees "Smfth" and "Jones" 
both have salary t20,000 and work in the sales department It Is now desired to transfer 
employee "Smith* out of the sales department, replacing him with employee "Davis* (with 
salary $30,000). If the primitive operations update row, insert row; and delete row are the 
only operations available and the assertion is checked after each primitive operation, the 
desired change cannot be legally accomplished. Thus the verification of this assertion must 
be deferred until the entire transaction (which consists of two primitive operations) is 
completed. 

Consequently, it can be semantically necessary and/or desirable' for the constraint 
expressor to specify precisely when an assertion is to be checked. For reasons of efficiency, 
it is also important to have the ability to specify that an assertion need only be checked at 
certain limited times, because verifying ft after every data base change that could cause its 
violation might be catastrophicatly expensive. 

Accordingly, the validity requirement of a relation constraint should be expressed in 
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terms of structured operations. For example, the validity requirement of some assertions 
might be that the assertion is to be checked after operation raise-salary. Each relation 
constraint validity requirement should consist of a list of structured operations after which 
the assertion component is to be checked. The special validity requirement "always" has the 
function of assuring that the assertion will be checked after any data base change that may 
cause its violation. 

It may be necessary to check one or more relation constraint assertions after each data 
base change is attempted (by a structured operation). The simplest type of data base 
change is a primitive update, insert, or delete tuple operation. Slightly more complex is the 
set-oriented tuple update, insert, or delete which may be expressed in the high level 
nonprocedural data selection and modification language (e.g., SEQUEL). Since structured 
operations are hierarchically organized, it may be necessary to check some assertions after 
each hierarchic structured operation. Consider, for example, the structured operation A, 
which is defined to have the effect of executing a delete tuple operation, followed by the 
execution of operation B. Operation B consists of a single update tuple operation. It may 
then be necessary to check some assertions after the delete tuple operation, after operation 
B, after the update tuple operation (in B), and after operation A. 

A special treatment of "null" (undefined) values as column entries is required. As 
noted by Eswaran and Chamberlin [Eswaran 1975], the checking of a relation constraint 
assertion should be such that the presence of "null" values should never cause the assertion 
to succeed if it would otherwise fail (be violated), and should never cause it to fail if it 
would otherwise succeed. An exception to this rule is made for assertions which explicitly 
reference "null" values (e.g., "Sex - null"). 
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6.4. Relation Constraint Violation-Action 

Associated with every occasion at which an assertion is to be checked, is a violation- 
action to be taken if the assertion is not satisfied upon attempted verification. Several types 
of violation-action can be specified: 

L An error can be signalled, and the requeued data base change rejected. A message 

is issued informing the user of the problem; the nature of this message may be 

explicitly specified as a part of the violatfon-actjpo. or it may be chosen by the 

system. 

*• A warning can be issued, but the illegal data base change allowed. The user may 

be warned with a system-generated message;, or a menage specif led as part of the 

violation-action. The warning may be persistent, in which case it appears whenever 

the potentially bad data is referenced. 

s - A corrective action can be specified, which attempts to repair the error; the 

assertion is then rechecked. This approach may be dangerous, but is appropriate in 

some cases. There are several types of corrective action: 

a. a substitute value may be specified to replace the offending data, 

b. a structured operation may be performed, 
c an external procedure may be ailed. 

If a corrective violation-action is attempted, the relation constraint assertion which 
caused its invocation is rechecked after the corrective action is performed. It is 
intended that corrected value and structured operation corrective actions handle the 
bulk of the corrective violation-action needs. However, U is possible to call an 
external procedure (which is written in some high level general purpose 
programming language) as a corrective action. This external procedure receives no 
special privliges with regard to data base interaction. There are of course other 



Semantic Integrity Specification 70 

problems which result from permitting such external procedures to be used, which 

are similar to those dUcuised in the context of domain definition violation-action (see 

section 3.4). <A more far-reaching set of probtems of this type is discussed by Minsky 

[Minsky 19761) - - , 

The actual interface which reports relation constraint violations to the user should 

actually allow this user to control the violation-action. The user should be consulted, if 

appropriate. For instance, assume that the user Wishes to perform an operation which gives 

employee "Jones" a 10* raise in salary. Assume also that there is a relation constraint 

assertion which states that the •sum o¥ saiafies of all the employees in each department of 

the company must be less than the biic%etW%at : de^Siill ia ^ppbse also that this 

assertion would be violated if the salary df *jones*iS mcreased by 10*. A reasonable 

violation-action might be to raise the salary of m }oMl^t&mMmMk permissible value, 

while reporting this to the user and asking for approval before actually performing the 

action. 

In this scheme, the violation-actions are associated with the assertion; they are part 
of the relation constalnt This means that violation-action information U not a part of the 
specification of the structured operations. All information regarding the checking of an 
assertion is localized in the relation constraint. This has the desirable effect of eliminating 
the arbitrary procedural embedding of violation-action information. 

6.5. Implementation Considerations 

A relation constraint language processor may be used to "compile' 4 relation constraints 
into an internal form. Relation constraints may be added to and deleted from a data base. 
(A constraint may be changed by deleting it and adding a revised version.) Adding a 
relation constraint consists of its compilation and initial checking. Normally, the constraint 
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must be satisfied when it is added to the data base. 

The internal form into which a relation constraint is compiled is used by the semantic 
integrity subsystem to check the integrity of the data base, and to take appropriate action 
which violations are detected. Moreover, the integrity subsystem manages all four aspects 
of semantic integrity, as discussed above and in chapter 7. 

6.6. Remarks 

The principal purpose of this chapter has been to impose some structure on the 
problem of relation constraint specification in the context of the semantic integrity of a 
relational data base. Important issues to be considered in future work include: 

1. a detailed analysis of the applicability of specific high level nonprocedural data 
selection languages to assertion specification (e.g., SEQUEL, QUEL, or Query by 
Example), 

2. a complete description of a disciplined specification methodology for relation 
constraints (including detailed example(s) of relation constraint specification), 

3. specifications of the user interface of the semantic integrity subsystem, vis-a-vis 
relation constraints, 

4. an analysis of the impact of the semantic integrity subsystem on other aspects of 
the data base system (e.g., data security), 

5. an assessment of the ramifications of various problems concerning relation 
constraints, including: 

a. redundancies, 

b. contradictions, 

c. circularities (because of corrective action side effects), 

6. a study of implementation techniques for relation constraint checking. 
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7. ON THE DESIGN OF A SEMANTIC INTEGRITY SUBSYSTEM 

The purpose of this chapter is to present some brief comments on several important 
aspects of the design of a semantic integrity subsystem. The purpose of such a subsystem is 
to manage the semantic integrity of a data base, as indicated by the semantic integrity 
specifications for that data base. 

7J. Components of a Semantic Integrity Subsystem 

We propose that a semantic integrity subsystem possess four principal components: 

L The semantic integrity language processors translate the specifications in the high 
level semantic integrity languages into internal forms useful to the semantic integrity 
subsystem. As discussed in this thesis, there are four semantic integrity languages, for 
domain definition, relation structure, structured operations, and relation constraints. 
(Actually, these four langauges may be viewed as sublanguages of a single semantic 
integrity language.) s .. , - ,*^ : rt , . 

2. The semantic integrity checker determines which domain definitions and relation 
constraints need to be checked after a given data base Change is performed, and 
performs that checking. ' i 

3. The semantic integrity violation-action processor takes appropriate action when a 
domain definition or relation constraint is violated. 

4. The relation constraint compatibility checker is responsible for insuring that the set 
of relation constraints currently extant for a daft base is free from contradictions and 
other undesirable properties. The compatability checker may be called by the relation 
constraint language processor when adding a new relation constraint, to make sure 
that it is acceptable to add it. The problem of designing and implementing a 
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comparability checker involves general techniques of deductive inference, automated 
theorem provers, etc Only a very limited compatibility checker could be practical at 
the present time. 

7.2. The User's View of the Integrity Mechanism 

It is extremely important to provide an effective user - data base system interface, 
especially with regard to the creation, maintenance, and reporting of semantic integrity 
information. There are actually three major types of users with which one needs to be 
concerned: 

1. the data base administrator (DBA), which may in fact be a single person or many 
persons, whose job is to create and maintain the semantic integrity specifications, 

2. the nonprogramming user, who deals, with the data base by means of generalized 
data selection and modification languges (e.g. SEQUEL, QUEL, or Query by 
Example), 

3. the applications program, which calls upon data base system facilities. 

Of course, a single person may serve both as a DBA and a (nonprogramming) user. The 
distinction between nonprogramming users and applications program} is made in order to 
distinguish the types of communication with the semantic integrity subsystem which are 
necessary. 

The DBA should be provided facilities which allow the following types of actions: 

1. add relation, 

2. delete relation, 

3. add domain, 

4. delete domain, 

5. add structured operation, 
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6. delete structured operation, 

7. add relation constraint, 

8. delete relation constraint. 

It should also be possible for a DBA to change the structure of relations, and modify the 
definition of domains, structured operations, and relation constraints. It is furthermore 
desirable to allow the DBA to ask questions about the semantic integrity specifications, 
especially the relation constraints. For example, it should be possible to ask which 
constraints may possibly be violated if an entry in a given column is changed, or which 
constraints have a given column entry as constrained data. 

The nonprogramming user must be provided with high level reporting of semantic 
integrity violations and violation-actions. In general, a (nonprogramming) user sees a set of 
data structures (domains and relations), a set of structured operations, and a set of relation 
constraints. When a domain definition or relation constraint is found to be violated, the 
user is either informed of this fact or an automatic corrective action is attempted. In any 
case, it must be possible to provide the user with a high level 'error message". The 
semantic integrity subsystem must not be completely silent (eg., see [Stonebraker 1974d. 
Stohebraker 1975c]). It must also be possible for the user to interact with the semantic 
integrity subsystem to attempt to repair an error, should that be appropriate. 

The applications program must be provided with capabilities similar to those for 
nonprogramming users, but all communication must be accomplished via procedure call and 
return, and message passing protocols. 

7.3. Some Thoughts on Integrity Subsystem Implementation 

Although a detailed investigation of implementation techniques for semantic integrity 
subsystems is an important research topic, little has been done on it to date. Stonebraker 
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and Wong [Stonebraker 1974d, Stonebraker 1975c] have proposed a very clean "query 
modification 1 ' approach to integrity checking, hu$ Mus* scheme has some limitations (eg., 
some useful types of techniques for the optimization of integrity checking are not handled). 
Sarin [Sarin 1976] is currently investigating thu topic in some detail. Jn HhU thesis, we are 
not principally concerned with the specifics of implementation techniques. However, we 
shall discuss a few important aspects of semantic integrity subsystem implementation. 

First of all, it is important that a data base togging and backup facility exist This is 
crucial in allowing the actions of a structured operation (transaction) *o b* "hashed out" and 
"undone", if occasioned by the violation of a domain definitieftof relation constraint 

It is sometimes the case that a data base change wttUame leveral domain definitions 
and relation constraints to be checked. (A data base change is accomplished by the 
invocation of a primitive or structured operation) A scheme must be developed for 
determining in what order these are to be checked* One way to handle this is to assign 
priorities to domain definitions and relation constraints; this may be done by the DBA or 
automatically by the semantic integrity subsystem. Domain definitions should receive 
priority over relation constraints (since they are always checked after primitive operations), 
and the various types of relation constraints can be ordered by their complexity, importance, 
or some other metric. 

Since relation constraint checking is potentially a costly undertaking, it i* crucial that 
efficient checking techniques be developed. Much of the w«k on optimizing data selection 
and modifiction languages is relevant here. Heuristic* may be developed for determining, 
on the basis of the patterns of data base interaction, which access paths and aids to 
maintain [Hammer 1976b3. One type of useful heurtstte involves the maintenance of 
aggregate values. For example, if there is a relation constraint assertion which states that 
the sum of employee salaries is less than $100,000, it may be helpful to maintain the sum 
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and update it as necessary; rather than constantly recalculating it when the assertion is 
checked. Other types of heuristics may also prove usef ul, eg, dealing with characteristics 
of individual types of physical storage devices (such as data clustering and page 
arrangement), or dealing with the maintenance and useof Inversion* (indices). 

7.3.1. The Use of Inversions in Relation Constraint Checking (An Example) 

As an example illustrative of the usef ulnes* of inversions in relation constraint 
checking, consider an example assertion. Suppose that the assertion (for the example data 
base of figure 1-2) states that. for each tuple B in relation BrJDGET, the entry in the 
Salary_budget column {BJSalary .budget) is greater than or equal to the sum of the entries 
in the Salary column of the tuples in EMP (El, „., En) Which have Department - 
B.Department Several primitive operations which may require this assertion to be checked 
are listed below, along with the method by which the it&es**ry checking may be 
accomplished and an indication of which inversions would be heJpfulin such checking: 

1. for some tuple B in BUDGET, Salary JbudgettefihangedJ 

a. find all tuples in EMP (II, .„ En) which have Department - B.Department, 

b. calculate S - El.Salary ♦ „ «-En.Satary, 
c check that S <- B^alaryjbudget, 

useful inversions* Department in EMP (for step a), 

2. for some tuple E m EMP, Salary is changed: 

a. find all tuples ti\ EMP {Iky., En) which have Department - EDepartment, 

b. calculate S - ELSalary ♦ „. ♦ En5alary, 

c find the tuple in BUDGET (B) which has Department - E.Department, 
d. check that S <« B.SatairyJ>udget, 
useful inversions: Department in EMP (for step a), Department in BUDGET (for 
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step c), 

3. for some tuple in BUDGET (B), Department is changed: 
(same as I), 

4. for some tuple in EMP (E), Department is changed, 
(same as 2), 

5. a new tuple is inserted into BUDGET (B), 
(same as 1), 

6. a new tuple is inserted into EMP (E), 
(same as 2). 

In this particular example, no checking needs to be done when tuples are deleted from 
EMP, since that can only cause the sum (S) to decrease. Of course, this is not true for all 
assertions involving sums of this type 
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8. REMARKS AND DIRECTIONS 

The major purpose of this thesis has been to provide a comprehensive, detailed 
analysis of the issues and problems associated with maintaining semantic integrity in a 
generalized (relational) data base system. The principal emphasis has been on the high 
level expression of semantic integrity specifications. The major portion of the work 
described herein has been concerned with providing a framework for semantic integrity 
specifications. Both the functional requirements for a solution to the semantic integrity 
problem and a specific approach to providing such a solution have been emphasized. An 
attempt has been made to indicate important directions for further work on semantic 
integrity. 

By way of conclusion, there are several important general directions for the extension 
of the work described in this thesis. The following are most significant: 

1. an analysis of important integrity specification language design issues (e.g., the 
usefulness of constructs in languages like SEQUEL, QUEL, and Query by Example, 
the adequacy of nonprocedural specification methodologies, th* importance of 
iteration and recursion, etc), 

2. the complete design of a language for semantic integrity specification, including 
sublanguages for each of the four aspects of semantic integrity (in the relational data 
model), 

3. the development of a well-directed, structured, disciplined approach to data base 
design (based on the semantic integrity framework), 

4. a comprehensive example of the application of the semantic integrity specification 
methodology described herein to a "real" application domain, 

5. the implementation of the semantic integrity subsystem outlined in this thesis, 
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6. an analysis of the cost of building, maintaining, and enforcing semantic integrity 
rules, 

7. a study of the relationship of semantic integrity issues with those of security, 
concurrent consistency, and query processing Onchiding the use e# deductive 
techniques), 

8. an evaluation of the ramifications of separating the four aspects of integrity to the 
extent described above (e^ an analysis of whether U is.necessajy to allow the 
information within a domain definition to be referenced in relation constraint 
assertions), and a study of the appropriateness of this approach, 

9. an evaluation of the applicability of a behavioral approach to the description of 
data semantics in an integrated data base environment, 

10. the extension of the semantic integrity scheme to *ltow multiple "views" of a data 
base, 

11. an evaluation of possible extensions to permit ^ftonabsoj|ut|st approach to integrity 
(involving the notions of quantized truth and confidence measures [Zadeh 1975J), 

12. a study of the ability of the approach to the semantic integrity problem described 
in this thesis to improve the overall effectiveness of a data base system. 
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Figure 1-1. Relation EMP 



column -> Name 
under lying 

domain -> NAME 



Sex Salary 

SEX MONEY 



Manager Department 
NAME DEPT 



Jones, Richard male 812,033 Jones, Richard research 
Phillips, Jeff male 818,833 Smith, Kathy sales 

Smith, Kathy female 811,883 Jones, Richard sales 
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Figure 1-2. Example Data Base 

Domains: 

NAME QUAhi 

SEX ORDERJIUM 

MONEY CUST 

DEPT DATE 
ITEM 



Relations: 



EMP (Name, Sex, Salary. Manager, Department) 
NAME SEX .MONEY NAME DEPT 

SALES (Item, Department, Quant Uy^jonjiand, Cost) 
ITEM DEPT QUAN MONEY 

ORDERS (Order .number, Customer, Item, Date_ehipped) 
ORDER JJUM CUST ITEM OATE 

BUDGET (Department, Salary .budget) 
DEPT MONEY 
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Figure 1-3. A Possible Set of Relational Primitive Operations 



create domain 
delete domain 
create relation 
delete relation 

insert tuple 
delete tuple 
update tuple 

add column to 

relation 
delete column 

from relation 
copy relation 
intersection 
un i on 
di f ference 
join 



(these operations allow domains and 
relations to be defined and deleted) 



(these operations allou changes to be 
made to data in relations) 



(^hese operations facilitate relation 
mod I f I cat i on and rel at i ona I a I gebra i c 
manipulation of a data base) 
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Figure 3-1. Selected Example Data Base Domain Definitions 

("Smith, John") 



doma 


in NAME 

description 

last: string 
t • 

first: string 
ordering 

last, first 
violation-action 

error 






doma 


in SEX 
description 








oneof 'female', 


'ma 


le* 




ordering 








none 








violation-action 








error 'sex must 


be 


female or male' 



("female") 



domain HONEY ("8103") 

description 

•8' 

value: number where >-0 

where length(right(*. ',' +1)) - 2 
or not present *, '.' 
order i ng 

value 
violation-action 

substitute null 'value in error, null has been assumed' 

domain ITEM ("AB-75-326") 

description 

string where not has numerics, '-' 

II: '-' 

i2: string where not has alphabetics, '-' 

where repititions il through i2 >«1 and <-3 
or 

string uhere call checkjtem 
ordering 

call compare_item 
violation-action 

substitute left(*, 5) 
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Figure 3-1. (continued) 



domain QUAN (17) 

description 

value: number where integer 
and >-8 
order i ng 

atomic 
v i o I at i on-act i on 

call fixup.quan 

domain DATE ("1/20/1976") 

description 

month: oneof 1, ..., 12 
v . 

day: number uhere integer and >»1 and <»31 

V197' 

year: number uhere i ntegei* art* >*§ imd np4J 

uhere (if (month - 4 or »5 or «9 or -11) then day<-30) 
and ( i f month - 2 then day <- 29) 
and (if (month - 2 and year ~- 6) then day <- 28) 
order i ng 

year, month, day 
violation-action 

error 
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Figure 3-2. Syntax of the Domain Definition Language 

domain-definition tt- DOMAIN domain-name 

0E8@»f*nON 

description-clause 
[ORDERING 

orderlng-clausel 
[VIOLATIOMICTIOW 

v i o I at I on-act I on-c I ausel 

domain-name i:« string-conetant 

description-clause :i- doscription-subclauee 

| description-clause 
OR 
dee.cr i pt i on-eubc I auee 

deacription-subclauae it- description 

[uhere-restrtetion} 

description ::- [label:] subunit 
| description 
IiaDe4i] subunit 

label ::- string-conetant 

subunit *:- STRING OJHERE string-boolean J 
| NUMBER UHERE numbsr-beotmani 
| ONEOF string-constant-Jist 
| ONEOF number-constant- list 

strlng-constant*-4.iat u« string-cons tan t-component 

I string-constant-list, string-constant-component 

string-constant-component tt- string-constant 

| ALPHABETICS 
| NUMERICS 
| SPECIALS 

number-constant- 1 ist s:- number-constant 

| number-constant- list, number-constant 

string-boolean ::- string-boo I san- term ""■- 

| string boolean OR *tr lo g ■ b oo lean-term 

string-boo lean- term :»- string-boo lean-factor 

| string-boo loan-term AND atrfnfl-bod I man- factor 
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Figure 3-2. (continued) 

string-boo lean- factor ::- string-boo I ean-pr#maryr 

I NOT r • tr i ng-bool eanr-pr i nary 

etring-boo lean-primary ::- string-predicate 

I (strings-boolean) 

string-predicate u« comparator string-constant 

| IF etring-predicate THEN sir fenfHiejedl cats 

[ELSE string-predicate! 
I SIZE comparator nutob*r»e*prtes«ton 
I HAS string-constant- list 
I CALL procedure 

comparator »:- - | ~» | >| >« | < | <■ 

number-boolean it- nuaber-booJean-term 

I number-boolean OR number-boo lean- term 

number-boo lean- term »:- number-boo lean- factor 

| number-boolean-term Atfl number-boolean- fact or 

number-boolean-factor ::• number-boolean-primary 

I NOT number-boolean-primary 

number-boolean-primary :»« number-predicate 

| (number-boo lean) 

number-predicate it- comparator number-constant 

I IF nu wfaf - i nd icate THEN nusber -predicate 

(ELSE number -predicate) 
I INTEGER 
I EXPONENTIAL 
j CALL procedure 

uhere-reetriction ::- boolean 

boolean ir- boo lean- term 

I boolean OR boo lean- term 

boolean-term it- boo lean- factor 

I boo lean- term AND boo lean- factor 

boo lean- factor ti» boolean-primary 

| NOT boolean-primary 

boolean-primary »:» predicate 

I (boolean) 
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Figure 3-2. (continued) 

predicate ::» expression comparator expression 
| IF predicate THEN predicate 
ELSE predicate] 
| PRESENT expression, string-constant-list 
| CALL procedure 

expression :s- [addition-operator! unsigned-expression 

unsigned-expression ::- arithmetic-term 

| unsigned-expression addition-operator arithmetic-term 

arithmetic-term ::- arithmetic-factor 

| arithmetic-term mul tiply-operator arithmetic-factor 

ari thmetic- factor :s- subexpression 

| (expression) 

subexpression ::- atomic-expression 

| set-function (expression- I ist) 

| APPEND (express ion, expression) 

| SUBSTRING (expression, expression, expression) 

| LEFT (expression, expression) 

j RIGHT (expression, expression) 

j LOCATION (expression, expression) 

I LENGTH (express ion) 

| REPITITIONS label THROUGH label 

atomic-expression ::- label 

I string-conatant 
j number-constant 
I * 

express ion- 1 ist ::- expression 

| express ion- 1 ist, expression 

set-function ::- MAXIMUM | MAX | MINIMUM | MIN | string-constant 

addition-operator ::- + | - 

mul tiply-operator ::- # | / | ** 

order ing-clause ::- ordering- 1 ist 

| NONE 
| ATOMIC 
| CALL procedure 
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Figure 3-2. (continued) 



ordering-Mat ::- label 

| ordering- list, label 

violation-action-clause jj« violation-action 

| violation-action-clause 
violation-action 

violation-action »j» ERROR 

I ERROR Message 

I SUBSTITUTE expression 

I SUBSTITUTE expression 

I GAIL procedMre message 

message »i- string-constant 
I SVSTEfl-GENERATED 

procedure ::- string-constant 



Notes: 

The nonterminals string-constant and iwmber-conetant are not 
further defined. , ,; 

ALPHABETIC refers to the characters "A" through "Z" and "a" 
through "z", NUMERICS refers to the digits 8 throws* S* a*«i 
SPECIALS refers to alt other character** i .. 

SIZE returns the length of a string eubunit/ HAS si* •••♦ an 
returns "true" if a subunit has an occurrence of each of the 
strings si, .... en (otherwise "false"). SIZE and HAS appear 
only in subunit where restrictions. 

SUBSTRING (e* il, 12} returns the subeftr \m °t atripg a> e*art ing 
at character il and extending i2 characters. LEFT(s, 1) and 
RIGHT(s.i) return the left and right substring jtrespeetivejuj 
of s having length i. SUBSTRING, LEFT, and RIGHT may a I eo be 
invoked with a second argument which is a siring.* IW», means 
that the substring is to start at the leftmost or rightmost 
occurrence of the second string argument, e.g.* "LEFT**, *.»)" 
and H LEFT(*. INOEXf*. •.•))" are equivalent. LENGTH (a) returns 
the length of string s. APPEND(sl,s2) concatenate*** and s2. 
L0CATI0N(sl,s2) returns the index of the -,# Jret occurjfcence of 
s2 in si (or 8 if s2 is not a substring of si). REPETITIONS 
si THROUGH s2 returns the number of repetitions (of the domain 
value) for subunits labeled si through s2. 
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Figure 6-1. Soma Simple Assertions (for data base in figure 1-2) 

Note: CC weans constrained col I set ion, PR means predicate 

1. The salary of every employes is Isss than 850,880. 

CC: each tuple in EMP 
PR: Salary < 58888 

2. The manager of each employee is also an employee. 

CC: each tuple in EMP 

PR: Manager ie present in sst of e14 Hemes from tup lee 
In EMP 

3. The salary of each employee in the toy department is lees 
than the salary? of his manager i 

CC: each tuple in ,EMP uhere Department * 'toy' 
PR: Salary <Selery of the tuple where Neee - Manager 
in constrained tuple 

A. The ealary of an employee cannot dec r eaee. 
CC: each tuple in EMP 
PR: new Salary >- old Salary 

5. The average employee salary is at Isast equal to the salary 
of Robert Jones. 

CC: set of tuples in EMP 

PR: averageJSalary) >- Salary of tuj*Je Hher*>Name « 
•Jonae, Rmfcert' 

6. Each department has at meet two ampioyeee |4th a ealary of 
more than SS8.888. - b it- 

COt est of tuples in EMP uhere Baler y > 5800 8, grouped 

by common Department 
PRt count (Name) <- 2 

7. The number o*i f ema I e employees is at least 40* of the total 
number of eepioyaee. 

CC: set of tuples in ETP uhere Sex - 'female* 

PR: ■- count (Name) >» .4 * count (Name) fori tuptes ,ln EMP 

8. Employee namee are unique. 

CC: set of tuples in EfF 

PR: multiset (Name) has no duplicates 
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Figure 6-2. Local Tuple Predicates 



Types of Predicates (a): 

la. col scalarcomp const 
2a. col scalarcomp col 
3a. col scalarcomp colexpr 

4a. col setcomp Icons t-1, ..., conat-m) 

5a. col setcomp (col -1, ..., col -ml 

6a. col setcomp (colexpr-1, .... colexpr-m) 

7a. col setcomp setexpr 

8a. (col-1, .... col-n) setcomp Kconst-41, ...» eonst-ln), .... 

(const-ml, ..,,, coms*«*n)i 
9a. (col-1, ..., col-n* s*tco«p {(col-li% ..,, col -In), ..», 

(col -ml, ..., col-mn)) 
10a. (col-1, .... col-n) setcomp Kcolexpr-11, ..., colexpr-ln), .... 

(colexpr-ml, ..., col«Hpr*m«)J 
11a. (col-1, ..., col-n) setcomp setexpr 



Def ini tions: 
col: 

const: 
scalarop: 

setop: 

colexpr: 

setexpr: 
scalarcomp: 

setcomp: 



column name with optional "old" or "new" 

(co 1-1 , col -11 , etc. , are col s; a 1 1 co I » must 

reference entries within the constrain**?" tuple) 

constant from an appropriate domain 

+,-,*, •/, **, mm, min, *tfc., or a user -de fined 

scalar operator 

unton (slid written as f)>, intersect!©**, difference, 

or a user-defined set operator 

a legal combination of col, const, op, and setop which 

yields a single value 

same as colexpr except yields a set of values 

". *■, >, >-, <, <», or a user-defined scalar 

comparator 

is in, contains, properly is in, properly contains, 

or a user-defined set comparator 
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Figure 6-3. Nonlocal Tuple Predicatee 

Types of Predicates (a): 

la. col scalar coup sea larva I 

2a. col setcoap setval 

3a. (col-1, .... col-n) eetcomp setval 

(In type 2a setval is a set of values, and in type 3a setval 
ie a set of tuples.) 

Definitions: 

Definitions hsrs are ths sane ae figure 6-2, except t 

sea larva I: a scalar value coaputod froa the data base 
eetvah a eet value computed froa th* data base 

NO predicates are the ease as Nl predicatee, except that the 
process selecting sea larva I and sstval nay reference the entries 
in the constrained tuple. 
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Figure 6-4. Local Set Predicates 



Types of Predicates (a): 

la. aggfn(col) scalarcomp const 

2a. aggfn(col) scalarcomp aggfn(col) 

3a. aggfn(col) scalarcomp aggfnexpr 

4a. aggfn (col-1, .... col-n) scalarcomp const 

5a. aggfn (col-1, . .., col-n) scalarcomp aggfn(col-l, .,., col-m) 

8a. aggfn(col-l, ..., col-n) scalarcomp aggfnexpr 

7a. set (col) setcomp (const-1, ..., const-n) 

8a. set (col) setcomp set (col) 

9a. set (col) setcomp setfnexpr 



10a. 8et(col-l, . 

(const-ml 
11a. set (co 1-1, . 

(co I -ml , 
12a. set (co 1-1, . 



., col-n) setcomp Uconst-11, .... const-ln), 

■*..-, const-mn)) 

. , col -«ri) setcomp Koe4-4i,- ».», coMln> r ... 

..» col -ran)) 

. , col-n) setcomp setfnexpr 



■ • » 



13a. col ere I col 

14a. col crel (col-1, .... col-m) 

15a. (col-1, ..., col-n) crel col 

18a. (col-1 col-n) crel (col-1, ..., col-m) 



Def ini tions: 



(col, const, scalarop, setop, colexpr, scalarcomp, setcomp are as 

in figure 6-2) 

aggfn: set, max, min, avg, sum, count, or a user-defined 

aggregate function (also all these with " ,H , e.g., 

"set"', meaning duplicates are retained) 
crel: one-to-one, functionally-dependent, or a user-defined 

column relationship comparator 
aggfnexpr: a legal combination of aggfn, col, const, scalarop, setop, 

and colexpr 
setfnexpr: a legal combination of "set", col, const, scalarop, setop, 

and colexpr 

"Set" returns the set of values in a column (or tuples in a group 
of columns. It is an aggfn, but is also treated separately since 
it yields a set value. 
(Note that "max (set (Salary))" is equivalent to "max (Salary)".) 
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Figure 6-5. Nonlocal Set Predicates 

Types of Predicates (a): 

la. aggfn(col) scalarcomp sea larva I 

2a. aggfn(col-l, .... col-n) scalarcomp sea larva I 

3a. 8et(col) setcomp setval 

4a. set (co 1-1, ..., col-n) setcomp setval 

(In type 3a, setval is a set of scalers, and in type 4a, setval 
is a set of tuples.) 

Definitions: 

Definitions here are the same as figure 6-4, except: 

sea larva It a scalar value computed from the data base 
setval: a set value computed from the data base 

NO predicates are the same as NI predicates, except that the 
process selecting sea larva I and setval may reference the data in 
the conetrained tuple set. 
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